0% found this document useful (0 votes)
10 views

Text For Chapter 4

This document discusses experiments tuning large language models (LLMs) for emotion detection in tourism reviews. The experiments used a new tourism emotion corpus called TORCE to test existing tools and tune 3 LLMs. By tuning the LLMs using the TORCE dataset, emotion classification performance improved significantly over the untuned LLMs and other tools. Tuning LLMs can potentially improve emotion detection and classification in a domain-specific way.

Uploaded by

everestchiboli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Text For Chapter 4

This document discusses experiments tuning large language models (LLMs) for emotion detection in tourism reviews. The experiments used a new tourism emotion corpus called TORCE to test existing tools and tune 3 LLMs. By tuning the LLMs using the TORCE dataset, emotion classification performance improved significantly over the untuned LLMs and other tools. Tuning LLMs can potentially improve emotion detection and classification in a domain-specific way.

Uploaded by

everestchiboli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Exploring and Tuning Large Language Models for Emotion

Detection of Tourism Reviews


Anonymous submission
Abstract
Automatic emotion detection and classification has been receiving increasing attention in the Natural Language Processing
area, and a variety of resources and methods have been investigated. In particular, with the increasing availability of large
language models (LLMs), such as BERT, we have access to rich resources for automatic emotion analysis. However, generic
language models need to be adapted and tuned for emotion analysis in a specific target domain for improved results. In this
paper, we report on our experiments in which we test a set of existing tools and three LLMs for automatic emotion analysis of
tourism review. Our experiment is based on a new tourism emotion corpus, named TORCE, which was built to address the
lack of test datasets in the tourism domain. In our experiments, we focused on examining how LLM tuning affects the
performance of tools based on LLMs. By tuning three LLMs using the augmented TORCE dataset, we improved the
emotion classification result significantly both over the untuned LLMs and other tools. This result provides strong evidence
that, by applying suitable tuning methods to LLMs, we can potentially improve emotion detection and classification significantly.

Keywords: Emotion Detection, Large Language Model tuning, Tourism Reviews, Crowdsourcing, Natural Language
Processing

——————–
Some text here was deleted to focus more on the text that
needs to be included in Chapter 4.
——————–

We also observed which emotion categories the


MTurk workers tend to assign to the same reviews by
using the Mutual Information association met- ric of
emotion category pairs. As the result, we found that
the pairs (“Anger”, “Disgust”) and (“Joy”, “Trust”)
show a strong association. An implication of such a
strong association can be that emotion
categories that are next to each other in the Plutchik Figure 1: Distribution of emotions in the TORCE
wheel of emotion have a stronger correlation. Also, we dataset.
found that about 1% of the reviews fall under the
“Fear” category. Based on the observations, we
decided to merge the emotion classes with a strong ent emotion classification schemes, techniques and
association in a single class for our experiments. resources. For our experiment, we chose the tools
Consequently, for our experiments, we used five based on the following criteria:
emotion classes including “anger”, “anticipation”,
“joy”, “sadness”, and “surprise” in our annotated 1. The tool is linked to published paper/s, allowing us
emotion reviews dataset. The distribution of emo- to formally cite their works.
tions in the TORCE dataset is shown in figure 1. As
2. The tool should be available publicly.
shown in the figure, currently 49% of TORCE data
falls under ”joy” category, and the remaining 51% 3. The tool should employ full or a subset of the
belongs to other four categories. Plutchik’s emotion scheme, because the TORCE
is annotated with this scheme.
1. Testing Existing Emotion As the result, we could choose four emotion detec-
Detection Tools for Tourism tion tools for our experiment:
Reviews
1. LeXmo 1: It is a Python package based on the
With our focus on the tourism domain, we wanted to NRC Emotion Lexicon Mohammad and Turney
test how the existing emotion detection tools per- form (2013). The lexicon is a database of word-
on our TORCE test data. We also wanted to use these emotion associations created using
tools as the baseline for evaluating LLMs.
Our survey shows there are a number of emotion
1
detection software are available, employing differ- https://github.com/dinbav/LeXmo
crowdsourcing approach. This lexicon con- tains
evaluation involves weighting of F-score of each
over 14,000 words which are mapped to eight
class, as follows:
emotions: anger, disgust, fear, joy, sad- Σ
ness, surprise, trust, and anticipation. It has (support × Fi)
been used to improve the performance of a wide i i
F − score weighted = Σ (1)
range of NLP tasks, including sentiment i supporti

analysis, sarcasm detection, and emotion de- where: supporti i, is the number of reviews in class
tection. and
Fi is the F-score for class i. By employing
2. EmoNet 2: It is a fine-grained emotion detec- this weighted F-score, the emotion classes with a
tion tool with Gated Recurrent Neural Networks higher number of reviews in the dataset will have a
(GRNNs) implemented by Abdul-Mageed and higher influence on the overall F-score. Table 1
Ungar (2017). GRNNs is a type of neural net- shows the weight for each emotion class in our
work that are well-suited for tasks such as emo- TORCE dataset.
tion detection, as they are able to learn long- term
dependencies in text data. EmoNet is trained on a Emotion class Weight
massive dataset of labeled text examples, which Anger 25%
includes over a million exam- ples from a variety Sadness 10%
of sources, such as social media, news articles, Joy 49%
and movie reviews. This diversity of data allows Surprise 10%
EmoNet to learn the nuances of human emotion Anticipation 6%
and to accurately detect a wide range of
emotions, even in com- plex and challenging Table 1: The weight for all emotion classes in the
contexts. dataset.

3. pysentimiento 3: It is a Python toolkit that Table 2 shows the weighted average F-score across
aims to provide an access to state-of-the-art large the tools under evaluation. As shown py-
language models for sentiment analy- sis and sentimiento has the best overall performance on
social NLP tasks (Pérez et al., 2021). Multiple TORCE dataset. Figure 2 demonstrate the F-score of
language models were tested on a multilingual detected emotion classes. Furthermore, It’s worth
emotion dataset labelled with six basic emotions, mentioning that pysentimiento and ETT do not
including “anger”, “disgust”, “fear”, “joy”, support the “anticipation” class. However, we can
“sadness” and “surprise”. observe that the “joy” class is the easiest class to
4. ETT 4: Poth et al. (2021) propose a method for detect. In contrast, seems to be harder to detect on
efficiently selecting intermediate tasks that can tourism review data.
improve the performance of a variety of NLP
Emotion Detection Tool F-score
tasks. Their method is based on the observation
NRC 0.47
that embedding-based methods, which rely solely
EmoNet 0.52
on the respective datasets, outperform
pysentimiento 0.67
computational expensive few-shot fine-tuning
ETT 0.59
approaches. They evaluated their methods on a
diverse set of 42 intermediate and 11 target Table 2: Weighted average F-score.
English classification, multiple choice, question
answering, and sequence tagging tasks. Emotion
detection is one of the wide range of tasks they
have presented. The model was trained on 6
different emotions in- cluding “anger”, “love”,
“fear”, “joy”, “sadness” and “surprise”.
These tools described above were evaluated us- ing
the TORCE as test dataset. We used the widely used F-
score as the evaluation metric. Because the TORCE
dataset is imbalanced in terms of pro- portion of data
from each emotion category, our
2
https://github.com/UBC-NLP/EmoNet
3
https://github.com/pysentimiento/pysentimiento
4
https://github.com/adapter-hub/efficient-task-
Figure 2: F-score for each emotion class.
transfer
2. Tuning Language Models for
160GB of text combined. Those models were se-
Tourism Review Emotion Detection lected from the Transformers package 5.
Before we tuned these models, we tested the
Machine Learning models are sensitive to the qual- ity
original models for emotion classification and they
and quantity of the training data. If the labelled data
produced poor F-scores, as shown in Table 3. Such
contains errors or biases, the model could learn and
results were expected because the models are trained
perpetuate these mistakes. Also, it is dif- ficult for
on generic large data. They need to be tuned for
machine learning to deal with imbalance in data. For
specific tasks such as text classification (Devlin et al.,
example, if a model is trained on an unbal- anced
2019).
emotion dataset, it would more likely to make mistakes
for the minority emotion classes. This is because the Language model F-score
model has seen fewer examples of the minority emotion BERT 0.11
classes during training, and may not have learned to DistilBert 0.31
identify them effectively. To address this issue, there Roberta 0.32
are some techniques that can be used to mitigate the
negative impact of the unbalanced training dataset. Table 3: Weighted average F-score for the lan- guage
Such techniques in- clude oversampling, models without fine-tuning.
undersampling and weighted learning. By using these
techniques, the perfor- mance of machine learning With regards to tuning LLMs, a critical factor is to
models on unbalanced datasets can be improved. choose a optimal learning rate value. The learning rate
As shown in figure 1, our tourism review dataset is a hyperparameter that controls the speed at which a
suffers from the scarcity of some emotion classes. machine-learning model updates its param- eters. If a
Therefore, we wanted to test different techniques to too high learning rate is applied, an LLM with may not
oversample the scarce emotion classes in order to be able to fully learn the patterns in the training data and
make them more equal to the majority classes in the may have difficulty in generalizing for new data. On
training dataset. For this purpose, we chose an aug- other hand, an LLM with a too low learning rate may
mentation technique which is based on the method take a very long time to train and may not be able to
suggested by Wei and Zou (2019), explained be- low: achieve the desired per- formance metrics. Therefore,
it is important to do trials with different learning rates
1. Random Insertion (RI): Based on the context to find the optimal learning rate. We tested a set of
of the review and using BERT language model, up different learning rates for each language model,
to two words are added to each review. including 2e-5, 1e- 5, 5e-6 and 1e-6. Moreover, the
batch size was equal to 16, and train for up to 20
2. Random Deletion (RD): Randomly delete up epochs. Table 4 shows the weighted average F-score
to two words per review. of emotion classification on our tourism reviews
dataset with different LLMs, learning rates, and
3. Random Swapping (RS): Randomly swap
augmentation techniques.
two neighboring words. This process is ap- plied
As shown in the table, each of the BERT, Dis-
up to two times for each review.
tilBert and Roberta were tuned using four tuning
techniques and combination of them. In addition, four
4. Synonym Replacement (SR): Randomly re-
learning rates were tested for each model and tuning
place up to two words with their synonyms using
method plus combination of them. The F- scores in
BERT language model.
blue and red colours indicate the best results for each
5. All Augmentation methods combined: learning rate.
Mixed use of the above mentioned augmenta- As the result, Roberta model with random inser-
tion methods. tion tuning method and ie-5 learning rate parame- ter
produced the best result of F-score 0.80 (see red
In our experiment on Large Language Models coloured score). Overall, Roberta model pro- duced
(LLMs) tuning, we tested three models including: 1) consistently higher F-scores compared to the other two
BERT (Devlin et al., 2019) base uncased trained on two models, producing promising results with all
datasets English Wikipedia and BookCorpus (Zhu et augmentation methods. Figure 3 shows the F-scores for
al., 2015), 2) DistilBert (Sanh et al., 2020) base four different learning rates across tun- ing methods,
uncased trained on the same datasets as BERT, but it and Figure 4 shows the F-scores for the five emotion
is a smaller and faster because it has less parameters classes produced by this model. In Figure 3, codes
than BERT model, and 3) Roberta (Liu et al., 2019) ”RI”, ”RD”, ”RS”, ”SR” and ”All Com- bine” indicate
base trained on five datasets weigh ”Random Insertion”, ”Random Dele-
5
https://github.com/huggingface/transformers
Lang. Model Learning Rate Random Insertion Random Deletion Random Swapping Synonym Replacement All combined
2e-5 0.76 0.74 0.78 0.75 0.78
1e-5 0.77 0.74 0.77 0.75 0.77
BERT 5e-6 0.77 0.74 0.76 0.75 0.77
1e-6 0.73 0.72 0.72 0.74 0.72
2e-5 0.72 0.73 0.74 0.73 0.74
1e-5 0.74 0.72 0.74 0.73 0.72
DistilBert 5e-6 0.73 0.72 0.74 0.71 0.72
1e-6 0.65 0.64 0.65 0.66 0.65
2e-5 0.78 0.79 0.77 0.79 0.77
1e-5 0.80 0.77 0.79 0.79 0.79
Roberta 5e-6 0.77 0.77 0.77 0.79 0.77
1e-6 0.76 0.75 0.77 0.76 0.75

Table 4: weighted average F-score for BERT, DistilBert and Roberta.

tion”, ”Random Swapping”, ”Synonym Replace-


ment” and ”All combined” respectively (same in
Figures 5 and 6).

Figure 5: The F-score for different learning rates with


DistilBert.

Figure 3: The F-score for different learning rates with


Roberta. sult of 0.78 F-score at the learning rate of 2e-5.
Overall, this model yielded moderate performance.
Also, the learning rate seems to have little influ- ence
on the model when it is tuned with synonym
replacement method. Figure 6 shows the F-scores
produced by this model for different learning rates
across different augmentation methods.

Figure 4: The F-score for each emotion class.

On the other hand, DistilBert model produced the


worst F-scores. Compared to other models, it under-
outperformed with all tuning methods. Rela- tively, Figure 6: The F-score for different learning rates with
random swapping method appears to be the best BERT.
augmentation method for training DistilBert. Figure 5
shows the F-scores for all different learn- ing rates If we compare the result of emotion classification
across augmentation methods applied to DistilBert. before and after the language model tuning (com- pare
In this experiment, the BERT model tuned with Table 3 vs. Table 4, we see a drastic increase of F-
combined augmentation methods tended to pro- duce scores. This showcases the positive impact of tuning
relatively higher F-scores, with the best re- the models.
When we compare Figure 2 vs. Figure 4, which
show the best performance of existing tools and
the LLM models for individual emotion categories,
Clifton Poth, Jonas Pfeiffer, Andreas Rücklé, and
i.e. use the best performance of the four existing tools
Iryna Gurevych. 2021. What to pre-train on?
as baseline, we can also see a significant im-
efficient intermediate task selection.
provement for four categories out of five. In detail, we
observe F-score improvement of 0.16 for anger, Juan Manuel Pérez, Juan Carlos Giudici, and Franco
0.08 for joy, 0.01 for sadness, 0.23 for surprise, and Luque. 2021. pysentimiento: A python toolkit for
0.54 for anticipation. This result provides an strong sentiment analysis and socialnlp tasks.
evidence that, if we apply suitable tuning methods for
LLMs, we can expect to improve automatic emo- tion Victor Sanh, Lysandre Debut, Julien Chaumond, and
detection and classification significantly. Thomas Wolf. 2020. Distilbert, a distilled ver- sion
of bert: smaller, faster, cheaper and lighter.

3. Conclusion Jason Wei and Kai Zou. 2019. Eda: Easy data aug-
mentation techniques for boosting performance on
In this paper, we reported on our study in which we text classification tasks.
explored the performance of existing emotion de-
tection tools and large language models (LLMs) for Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan
tourism reviews in social media. In particular, we Salakhutdinov, Raquel Urtasun, Antonio Tor- ralba,
examined the impact of language model tuning us- ing and Sanja Fidler. 2015. Aligning books and
augmentation techniques on the performance of movies: Towards story-like visual explana- tions by
emotion classification. This experiment is based on a watching movies and reading books.
new tourism emotion corpus named TORCE, which
was compiled by collecting tourists’ reviews from the
tourism website TripAdvisor and manually annotated
with emotion information. Our experi- mental results
demonstrate that LLM tuning has a positive impact
on the automatic emotion classi- fication, and if we
apply suitable tuning methods, we can expect a
significant improvement of perfor- mance of the tools
based on LLMs. In future work, we will extend the
study on a larger test dataset and more LLMs.

4. Bibliographical References

Muhammad Abdul-Mageed and Lyle Ungar. 2017.


EmoNet: Fine-grained emotion detection with gated
recurrent neural networks. In Proceedings of the
55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long
Papers), pages 718–728, Vancouver, Canada.
Association for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and


Kristina Toutanova. 2019. Bert: Pre-training of deep
bidirectional transformers for language un-
derstanding.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du,


Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
2019. Roberta: A robustly optimized bert pre-
training approach.

Saif M. Mohammad and Peter D. Turney. 2013.


Crowdsourcing a word-emotion association lexi-
con. Computational Intelligence, 29(3):436–465.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy