0% found this document useful (0 votes)
29 views18 pages

2024 Inlg-Main 15

Uploaded by

François Guias
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views18 pages

2024 Inlg-Main 15

Uploaded by

François Guias
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Should We Fine-Tune or RAG?

Evaluating Different Techniques to Adapt LLMs for Dialogue

Simone Alghisi † , Massimo Rizzoli † , Gabriel Roccabruna,


Seyed Mahed Mousavi, Giuseppe Riccardi
Signals and Interactive Systems Lab, University of Trento, Italy
{s.alghisi, massimo.rizzoli, giuseppe.riccardi}@unitn.it

Abstract conversations with a system that provides factual


We study the limitations of Large Language responses), Task-Oriented Dialogues (i.e. conversa-
Models (LLMs) for the task of response gen- tions where the system helps a user to achieve a spe-
eration in human-machine dialogue. Several cific goal), and Question Answering (i.e. question-
techniques have been proposed in the litera- answer exchanges given context).
ture for different dialogue types (e.g., Open- However, recent studies have shown the short-
Domain). However, the evaluations of these comings of LLMs as dialogue model surrogates
techniques have been limited in terms of base as they are prone to generate toxic, biased, and ir-
LLMs, dialogue types and evaluation metrics.
In this work, we extensively analyze differ-
relevant responses (Zhang et al., 2020; Mousavi
ent LLM adaptation techniques when applied et al., 2022, 2023; Lin and Chen, 2023). To
to different dialogue types. We have selected adapt LLMs to dialogue types, different techniques
two base LLMs, Llama2C and MistralI , and have been employed such as in-context learn-
four dialogue types Open-Domain, Knowledge- ing (Brown et al., 2020; Chen et al., 2023; Meade
Grounded, Task-Oriented, and Question An- et al., 2023; Hudeček and Dusek, 2023) and fine-
swering. We evaluate the performance of in- tuning (Wang et al., 2022; Komeili et al., 2022;
context learning and fine-tuning techniques
Huang et al., 2023). Furthermore, strategies such
across datasets selected for each dialogue type.
We assess the impact of incorporating external as grounding (Gopalakrishnan et al., 2019; Zhao
knowledge to ground the generation in both et al., 2023) and Retrieval-Augmented Generation
scenarios of Retrieval-Augmented Generation (RAG) (Lewis et al., 2020; Borgeaud et al., 2022)
(RAG) and gold knowledge. We adopt consis- have been proposed to improve the generation qual-
tent evaluation and explainability criteria for ity.
automatic metrics and human evaluation pro- Currently, the performance of the aforemen-
tocols. Our analysis shows that there is no tioned techniques in adapting LLMs across differ-
universal best-technique for adapting large lan-
guage models as the efficacy of each technique
ent dialogue types is understudied. Previous studies
depends on both the base LLM and the specific have evaluated these techniques in a specific dia-
type of dialogue. Last but not least, the assess- logue type only (Raposo et al., 2023; Zhang et al.,
ment of the best adaptation technique should 2023). Such studies are based on different base
include human evaluation to avoid false expec- models and are assessed via incomparable evalua-
tations and outcomes derived from automatic tion methodologies.
metrics. In this work, we conduct an extensive study on
1 Introduction the efficacy of different techniques to adapt LLMs
for multiple dialogue types. We select Llama-2
In recent years, Large Language Models (LLMs) Chat (Llama2 ) (Touvron et al., 2023) and Mis-
C
have been employed for the task of response gener- tral Instruct (Mistral ) (Jiang et al., 2023) as base
I
ation in human-machine dialogues (Hosseini-Asl LLMs, and experiment with in-context learning and
et al., 2020a; Izacard and Grave, 2021; Komeili fine-tuning in the context of four dialogue types: a)
et al., 2022). Such models have been applied to Open-Domain Dialogues (ODDs), b) Knowledge-
several dialogue types, including Open-Domain Grounded Dialogues (KGDs), c) Task-Oriented Di-
Dialogues (i.e. informal conversations about triv- alogues (TODs), d) Question Answering (QA). Be-
ial matters), Knowledge-Grounded Dialogues (i.e. sides, we assess the impact of incorporating exter-

Equal contribution. nal knowledge by considering retrieved knowledge
180
Proceedings of the 17th International Natural Language Generation Conference, pages 180–197
September 23–27, 2024. ©2024 Association for Computational Linguistics
and gold knowledge. In the retrieved knowledge consistent and factual answers. To improve the
scenario, we use RAG to add the knowledge to the generation quality, previous works have studied
model’s input. We assess the performance of each the impact of knowledge selection (Qin et al.,
technique using the same automatic metrics and 2023; Sun et al., 2023), different knowledge
comparable human evaluation. We further com- representations (Mousavi et al., 2023; Yang
pute the contribution of each segment of the input et al., 2023), additional knowledge elements (e.g.
vector by using integrated gradients as an explain- dialogue acts, topics) (Hedayatnia et al., 2020),
ability attribution method. We evaluate the models training without knowledge supervision (Han et al.,
using an open human evaluation protocol (Mousavi 2023), and in-context learning (Chen et al., 2023).
et al., 2022) designed for dialogue contextualiza- Task-Oriented Dialogue (TOD) LLMs have
tion, appropriateness, correctness, and validity. In been fine-tuned for TOD modeling for joint
summary, the main contributions of this paper are: dialogue state tracking and response genera-
tion (Hosseini-Asl et al., 2020b; Kulhánek et al.,
• Adaptation of Llama2C and MistralI using 2021; Wang et al., 2022; Ding et al., 2024), and ro-
fine-tuning and in-context learning1 in four bustness to spoken interactions (Thulke et al., 2024;
different dialogue types and corresponding Mousavi et al., 2024). Recent studies focus on
corpora; augmenting the TOD modeling with unstructured
knowledge access (Feng et al., 2020; Kim et al.,
• Assessment of the impact of grounding the
2020, 2021). In this regard, He et al. (2024) have
response generation on external knowledge,
proposed a pipeline for retrieval and grounded re-
both in cases of retrieved knowledge and gold
sponse generation. Raposo et al. (2023) compared
knowledge;
in-context-learning and fine-tuning, but considered
• Extensive study on the efficacy of each tech- retrieved replies from previous dialogues as knowl-
nique using automatic evaluations and human edge.
evaluation, including explainability and cate- Question Answering (QA). In the most general
gorization analysis of natural language gener- setting, relevant documents need to be retrieved
ation errors. to provide an answer (Lee et al., 2019; Qu et al.,
2020). Some studies have proposed to select the
2 Literature Review documents with the highest similarity with the ques-
tion computed between their BERT encodings (Lee
Open-Domain Dialogue (ODD) In earlier studies,
et al., 2019; Karpukhin et al., 2020). With this re-
sequence-to-sequence models have been trained for
trieval strategy, some studies have fine-tuned LLMs
response generation in open-domain dialogues (Li
to condition the generation on the retrieved docu-
et al., 2017). However, such models suffered
ments through grounding (Lewis et al., 2020; Izac-
from generating generic or inappropriate responses
ard and Grave, 2021) or cross-attention (Borgeaud
(Zhang et al., 2020). To improve the generation
et al., 2022). Other works generated the answers
quality, studies grounded the generation on exter-
using in-context learning with zero-shot (Levine
nal knowledge, such as persona statements (Wolf
et al., 2022; Cho et al., 2023). A survey com-
et al., 2019; Kasahara et al., 2022; Xu et al., 2022b),
pared existing generation-only, retrieval-only, and
the personal graph of user interactions (Mousavi
RAG models (Zhang et al., 2023) but with differ-
et al., 2023), and retrieved documents (Huang et al.,
ent base models, hindering the comparison of the
2023). While the previous works developed data-
techniques.
driven models using training/fine-tuning, recent
studies have explored the potential of in-context
3 Experiments
learning with LLMs (Qian et al., 2023).
Knowledge-Grounded Dialogue (KGD) We study and compare in-context learning and
Sources such as Wikipedia have been used as fine-tuning as techniques to adapt LLMs for
unstructured knowledge to ground the generated human-machine dialogues. We select Llama-
responses (Dinan et al., 2019; Gopalakrishnan 2 Chat (Llama2C ) (Touvron et al., 2023) and
et al., 2019; Komeili et al., 2022) to generate Mistral Instruct (MistralI ) (Jiang et al., 2023)
1
The code is available at https://github.com/ as base LLMs, and experiment in the context
sislab-unitn/Fine-Tune-or-Rag of four dialogue types: Open-Domain Dialogue
181
(ODD), Knowledge-Grounded Dialogue (KGD), Type Dataset #Dials
Avg. #Ext.
Task-Oriented Dialogue (TOD), and Question An- #Turns Know.
swering (QA). For each technique and dialogue ODD DailyDialog 13k 8 —

type, we assess the impact of grounding the gen- KGD WoW 20k 9 61
TOD DSTC9 Track 1 9k 19 2900
eration on documents in the scenarios of retrieved QA NarrativeQA *
47k 2 1572
knowledge (RAG) and gold knowledge.
Table 1: Selected datasets for each dialogue type: Open-
3.1 Datasets Domain Dialogue (ODD), Knowledge-Grounded Di-
In our experiment, we have selected a dataset for alogue (KGD), Task-Oriented Dialogue (TOD), and
each of the four dialogue types (see §A.1 for selec- Question Answering (QA). #Ext. know. indicates the
number of documents in the unstructured knowledge
tion). The statistics of these datasets are summa-
base. † In KGD the content of the knowledge base dif-
rized in Table 1. fers at each turn with an average of 61 ± 22 documents.
Open-Domain Dialogue (ODD) We select Dai- *
Question-answer exchanges.
lyDialog (Li et al., 2017), a widely-used dataset
of human-human dialogues crawled from various
than the pre-training phase. In a dialogue setting,
websites used by English learners to practice. The
fine-tuning should teach LLMs to behave as di-
final dataset contains 13k written dialogues with an
alogue models and account for each state of the
average of 8 turns per dialogue.
conversation between speakers.
Knowledge-Grounded Dialogue (KGD) We
As a baseline, for both techniques, we consider
experiment on Wizard of Wikipedia (Dinan et al.,
the context (i.e. the question for QA, the history
2019), a dataset of dialogues between two partici-
for ODD, KGD, and TOD) as the input and use
pants with the roles of apprentice and wizard. At
the default prompt structure of the models to sepa-
each turn, the wizard can access a set of documents
rate user and system turns. Additionally, for TOD
(passages from Wikipedia) and use it to incorpo-
we append the dialogue state (a summary of user
rate factual knowledge in their reply. The dataset
requirements), following previous work on this di-
contains 20k dialogues about one of 1359 distinct
alogue type (Wang et al., 2022; Ding et al., 2024).
topics and provides an unseen set of documents for
For KGD, we prepend the topic to the start of the
testing.
dialogue.
Task-Oriented Dialogue (TOD) We select the
dataset proposed for the first track of the ninth 3.3 Knowledge
Dialogue System Technology Challenge (Kim
Incorporating external knowledge for the task of
et al., 2020), an augmented version of MultiWOZ
response generation has been shown to improve the
2.1 (Eric et al., 2020). The dataset spans over 7
factual accuracy (He et al., 2024) and contextual-
domains and contains 9k multi-domain dialogues.
ization (Mousavi et al., 2023) of responses.
The dialogues include turns where the system needs
For each of the selected types but for ODD, we
to access an unstructured knowledge base of 2900
consider their corresponding unstructured knowl-
documents (FAQs) to provide a correct response.
edge base. Regarding KGD, we consider passages
Question Answering (QA) We select Narra-
from Wikipedia, while for TOD we consider FAQs
tiveQA (Kočiský et al., 2018), a dataset of 47k
related to services and places (e.g. restaurants, ho-
questions with free-form answers based on 1.5k
tels, taxi booking). For QA we consider all the
books and movie scripts. The question-answer
summaries of the books and movies.
pairs are formulated based on summaries of the
For both in-context learning and fine-tuning, we
books and movies.
study the impact of knowledge on the generated
3.2 Techniques responses, in two scenarios:
We evaluate in-context learning and fine-tuning as • Retrieved knowledge: we retrieve k docu-
techniques to adapt LLMs for response generation ments from the unstructured knowledge base;
in the selected dialogue types. In-context learning
• Gold knowledge: we use the ground truth
is a technique that uses instructions and examples
document.
to condition the generation. Instead, fine-tuning
further trains the model (completely or partially) For the retrieved knowledge scenario, we use the
on the task of interest using a smaller-scale dataset Retrieval Augmented Generation (RAG) strategy.
182
External Perplexity
Model Technique
Knowledge
ODD KGD TOD QA
No Know. 64.13 35.17 25.15 1442.26
In-Context Learning Retrieved Know. 33.10 24.72 625.08
Gold Know. 24.40 23.81 298.16
Llama2C
No Know. 5.67 ± 0.01 7.63 ± 0.01 3.06 ± 0.01 12.03 ± 0.06
Fine-Tuning Retrieved Know. 6.95 ± 0.01 3.97 ± 0.01 5.47 ± 0.02
Gold Know. 4.38 ± 0.01 3.12 ± 0.01 4.98 ± 0.01
No Know. 14.19 15.31 9.82 91.42
In-Context Learning Retrieved Know. 14.75 9.76 42.58
Gold Know. 9.81 9.37 16.74
MistralI
No Know. 6.41 ± 0.01 8.67 ± 0.01 3.56 ± 0.01 14.11 ± 0.01
Fine-Tuning Retrieved Know. 7.78 ± 0.01 3.61 ± 0.01 5.97 ± 0.01
Gold Know. 5.17 ± 0.01 3.58 ± 0.01 4.88 ± 0.01

Table 2: Automatic Evaluation Perplexity of Fine-Tuning and In-Context Learning with Retrieved (top-3) and
Gold (ground-truth) knowledge, on Llama2C and MistralI , in different dialogue types: Open-Domain Dialogues
(ODDs), Knowledge Grounded Dialogues (KGDs), Task-Oriented Dialogues (TODs), and Question Answering
(QA). Results for fine-tuned models report mean and standard deviation over three runs.

We use an off-the-shelf retriever2 (model details in for dialogues. We select Llama2C and MistralI as
§A.2) to retrieve documents from the unstructured base LLMs and experiment in four dialogue types:
knowledge base. First, we encode all the docu- ODDs, KGDs, TODs, and QA. For each dialogue
ments considering their content together with their type, we study the impact of external knowledge,
topic (KGD), place or service name (TOD), or title both retrieved and gold. Further details about the
(QA) (Karpukhin et al., 2020). Then, at each turn, implementation and the resources used are avail-
we retrieve the k most similar documents based on able in the appendix (§A.2).
L2 distance with the encoded context. Finally, we
feed the retrieved documents to the base models 4.1 Automatic Evaluation
together with the context to generate a response. Currently available automatic metrics used for the
In the gold knowledge scenario, we directly task of response generation are not interpretable
feed the model with the ground truth documents. and correlate poorly with human judgments (Liu
This serves as an upper bound for RAG. Addition- et al., 2016; Sai et al., 2022; Mousavi et al., 2022).
ally, this strategy allows us to study the ability of Therefore, we focus on perplexity as it is derived
the techniques to incorporate knowledge in the re- from the objective function used to fine-tune the
sponses. models, and present other metrics in §A.3.
Table 2 reports the perplexity of Llama2C and
3.4 Models
MistralI on the test set of each dialogue type. In
We select the widely-used 7B version of Llama2C all dialogue types, fine-tuned models have obtained
and MistralI as base models. For in-context learn- better performance compared to in-context learning.
ing, we experiment with three instructions for each When considering the impact of external knowl-
dialogue type and select the best based on the de- edge, models fine-tuned on TODs show that knowl-
velopment set performance. For fine-tuning, we edge slightly increases perplexity. The high per-
use LoRA, a parameter-efficient technique that has plexity obtained by in-context learning models on
shown comparable performance to fine-tuning all QA can be explained by two reasons: first, be-
parameters (Hu et al., 2021). Further details about sides the knowledge, only the question is used as
the parameters are reported in §A.2. context; second, while the ground truths are partic-
ularly short (4.26 tokens on average), these models
4 Evaluation generate long responses, making them unlikely to
We conduct a comparative study on the impact of include the correct answer in the first few tokens.
in-context learning and fine-tuning to adapt LLMs This does not happen for fine-tuned models since
they are trained to generate shorter responses. Nev-
2
https://github.com/langchain-ai/langchain ertheless, the best results have been obtained with
183
% of Tokens w. Significant Contribution in Each Segment
Dialogue
Model Technique
Type Topic/Dialogue Dialogue
Instruction Knowledge
State History
In-Context Learning 21.85 28.60 15.97 33.58
KGD
Llama2C Fine-Tuning 39.43 13.80 46.77
In-Context Learning 25.98 19.54 16.46 38.02
TOD
Fine-Tuning 27.19 8.04 64.77
In-Context Learning 69.01 14.89 16.10
KGD
MistralI Fine-Tuning 65.55 11.00 23.45
In-Context Learning 69.05 10.19 11.24 9.52
TOD
Fine-Tuning 14.55 29.06 56.39

Table 3: Explanability Study Percentage of tokens with significant contribution to the generation in different
segments of the input vector for each model in Knowledge-Grounded Dialogues (KGDs), and Task-Oriented
Dialogues (TODs). All rows sum to 100. For KGD, the second column reports the contribution of the Topic,
while for TOD it reports the contribution of the Dialogue State. The Instruction segment is only present for
In-Context Learning.

gold knowledge. We report automatic evaluation re- the generated responses to gain more insight into
sults including retriever accuracy, overlap between the models’ performance. Mousavi et al. (2022)
knowledge and response tokens, and other auto- proposed four dimensions to evaluate response gen-
matic metrics in §A.3. eration based on the most common errors and quali-
ties. We evaluate the responses using their protocol
4.1.1 Explainability Study and three of their dimensions:
To understand the contribution of each segment of
the input vector (i.e. instruction, context, knowl- • Contextualization: the response includes ex-
edge, topic, and dialogue state), we compute inte- plicit or implicit references to the dialogue
grated gradients (Sarti et al., 2023)3 of input ele- history (ODD, KGD, TOD) or the gold knowl-
ments and select the most contributing input tokens edge (QA);
(top-25%). Table 3 reports the percentage of most • Appropriateness: the response is coherent
contributing tokens that fall in each segment (nor- and makes sense as a continuation of the dia-
malized by the length of the segment). In general, logue;
in both KGD and TOD, the dialogue history is the • Correctness: the response is grammatically
least contributing segment, which might indicate and syntactically correct.
that only a part of the history is significant for re-
According to these dimensions, we evaluate the re-
sponse generation. On the other hand, in KGD
sponses for all techniques, models, and knowledge
the topic has a higher score than the dialogue his-
scenarios, in all dialogue types. The only excep-
tory, suggesting its importance for response gener-
tion is QA, where we do not evaluate "Appropri-
ation for this dialogue type. Interestingly, MistralI
ateness" since the dimension considers coherence
gives considerably more importance to the topic
with respect to a dialogue history but QA only has
than Llama2C , decreasing the importance of the
question-answer exchanges. Instead, we extend the
knowledge segment. For the TOD type, the most
protocol4 by proposing a new dimension for QA:
contributing segment is often the knowledge, reach-
ing over 50% with fine-tuning. This suggests that • Validity: the response includes adequate in-
knowledge is more relevant for TOD and that rele- formation to answer the question.
vance changes with respect to the dialogue type.
For TOD we do not include a dimension to evaluate
4.2 Human Evaluation whether the response is in line with user require-
Considering the uninterpretability of automatic ments, as this can be measured automatically (via
evaluations, we conducted a human evaluation of 4
The extended protocol is available at https://github.
com/sislab-unitn/Human-Evaluation-Protocol/tree/
3
We use Inseq to compute integrated gradients. v1.1
184
External Contextualization Appropriateness Validity
Model Technique
Knowledge
ODD KGD TOD QA ODD KGD TOD QA
No Know. 85 70 70 50 80 70 60 10
In-Context Learning Retrieved Know. 75 65 70 75 45 35
Gold Know. 90 40 90 85 45 80
Llama2C
No Know. 45 60 70 15 50 65 60 15
Fine-Tuning Retrieved Know. 65 90 45 80 80 45
Gold Know. 80 85 85 65 85 75
No Know. 90 80 70 20 85 85 65 20
In-Context Learning Retrieved Know. 75 65 40 65 60 25
Gold Know. 90 55 75 70 55 80
MistralI
No Know. 55 90 85 25 55 80 80 20
Fine-Tuning Retrieved Know. 95 85 30 85 90 40
Gold Know. 80 75 70 65 70 70
Ground-Truth 95 80 95 90 100 85 95 90

Table 4: Human Evaluation Percentage of Contextualized, Appropriate (ODD, KGD, TOD), and Valid (QA)
responses for In-Context Learning and Fine-Tuning with Retrieved (top-3) and Gold (ground-truth) knowledge,
on Llama2C and MistralI , in different dialogue types: Open-Domain Dialogues (ODDs), Knowledge Grounded
Dialogues (KGDs), Task-Oriented Dialogues (TODs), and Question Answering (QA).

dialogue state tracking metrics e.g., Joint Goal Ac- different dialogue types. As for ODDs, we report
curacy). The dimensions can either have a positive no results for the Retrieved and Gold Knowledge
or negative answer value, as well as "I don’t know" scenarios since no knowledge was used for this dia-
to avoid forcing erroneous judgments on any of the logue type. Additional results on "Correctness" are
two sides. For "Contextualization" and "Appropri- reported in §A.4.
ateness", we also ask the annotators to motivate the Open-Domain Dialogue (ODD) Models fine-
negative judgments with the explanations proposed tuned for ODD tend to generate considerably less
in the original protocol. We present the explana- contextualized responses than models adapted us-
tions and related results in §4.3. ing in-context learning. In particular, fine-tuning
We recruited 75 annotators on the Prolific plat- Llama2C reduces contextualization by 40%, while
form5 , and we assigned 5 dialogues to each annota- for MistralI by 35%. Similarly, fine-tuning reduces
tor. After performing quality control, we approved their appropriateness by 30% compared to their
65 annotators with a compensation of 9.00£/hour in-context learning version. This contrasts with
(marked as good on the Prolific platform). Due automatic evaluation (Table 2), where in-context
to the large number of responses, each annotator learning obtained a higher perplexity (i.e. worse
evaluated a different set of model responses for a results) compared to fine-tuning.
given dialogue. For the purpose of quality control, Knowledge-Grounded Dialogue (KGD) Con-
for each dialogue type, two dialogues were overlap- cerning KGD, the results are model-dependent.
ping among five annotators, while the remaining When considering Llama2C , in-context learning
dialogues were annotated by one crowd-worker provides, regardless of the knowledge, 10% more
with an overlap only on the ground truth. The contextualized responses compared to fine-tuning.
inter-annotator agreement measured with Fleiss’ κ On the other hand, fine-tuning MistralI on Re-
(Fleiss, 1971) was 0.65 (substantial agreement). trieved Knowledge leads to the highest contextu-
As results of the human evaluation (Table 4), alization (95%). However, using Gold instead of
we report the percentage of positively judged re- Retrieved Knowledge reduces the contextualiza-
sponses (Contextualized, Appropriate, Valid) for tion of the fine-tuned model by 15%. Furthermore,
Llama2C and MistralI when considering different when considering the best models, Llama2C and
adaptation techniques (Fine-Tuning and In-Context MistralI have a higher contextualization than the
Learning) and knowledge (No Knowledge, Re- ground truth (10 to 15%), suggesting that models
trieved Knowledge, and Gold Knowledge) across copy more from the dialogue history. Similarly
to contextualization, adapting Llama2C with in-
5
https://www.prolific.com/ context learning and Gold Knowledge provides
185
Not Contextualized Not Appropriate Not Contextualized Not Appropriate

Figure 1: Percentage of LLM responses (y-axis) for Figure 2: Percentage of LLM responses (y-axis) for
each error type (Not Contextualized and Not Appro- each error type (Not Contextualized and Not Appro-
priate) and their explanation (Generic, Hallucinated, priate) and their explanation (Generic, Hallucinated,
and Incoherent) (x-axis), for Llama2C and MistralI , and Incoherent) (x-axis), for Llama2C and MistralI ,
adapted with In-Context Learning and Fine-Tuning in adapted with In-Context Learning and Fine-Tuning in
Open-Domain Dialogues (ODDs). Knowledge-Grounded Dialogues (KGDs).

the highest percentage of appropriate responses


(85%). Instead, fine-tuning (on Retrieved Knowl- Question Answering (QA) In QA, results show
edge) or adapting MistralI with in-context learn- improved contextualization and validity when in-
ing (using No Knowledge) provides comparable cluding knowledge, with the best results obtained
appropriateness (85%). While according to auto- with gold knowledge. When considering the best
matic evaluation (Table 2) fine-tuning is always model for each technique, in-context learning in-
the best technique, human evaluation results show creases the percentage of contextualized responses
comparable appropriateness and contextualization by 5%. These results greatly differ from Table 2
for in-context learning and fine-tuning. and show how unreliable automatic evaluation can
Task-Oriented Dialogue (TOD) When adapt- be. Although models fine-tuned on No or Retrieved
ing Llama2C and MistralI to TOD, the results Knowledge obtain comparable or higher validity
clearly show that fine-tuning is preferable over in- than in-context learning, adding Gold Knowledge
context learning. In particular, if we consider the to adapt Llama2C and MistralI with in-context
best model for each technique, when fine-tuned learning increases their validity respectively by 5%
Llama2C generates 20% more contextualized re- and 10%. Finally, even with Gold Knowledge,
sponses, while MistralI generates 15% more. Al- no model reaches the validity of the ground truth
though fine-tuned models benefit from external (90%).
knowledge, Retrieved and Gold Knowledge vis-
ibly reduce contextualization of in-context learning These findings indicate that the best technique
models (at most by 30% for Llama2C and 15% depends on the dialogue type and the base LLM.
for MistralI ). Similar behavior can be observed Regarding the techniques, in-context learning leads
for in-context learning in terms of appropriateness, to more contextualized and appropriate responses
where Gold Knowledge reduces Llama2C results in ODDs, while fine-tuning improves contextual-
by 15% and MistralI by 10%. This is in line with ization and appropriateness in TODs. Regarding
the explainability study (Table 3), where models the base LLMs, in KGDs adapting Llama2C with
adapted with in-context learning have a lower con- in-context learning leads to the best results, while
tribution from the knowledge segment than their MistralI benefits the most from fine-tuning. Fur-
fine-tuned version. In general, if we consider the thermore, in QA the quality of knowledge impacts
best models for each technique, fine-tuned models contextualization and validity the most, while adap-
generate 25% more appropriate responses. tation techniques have a minor effect.
186
Not Contextualized Not Appropriate Not Contextualized

Figure 3: Percentage of LLM responses (y-axis) for Figure 4: Percentage of LLM responses (y-axis) for
each error type (Not Contextualized and Not Appro- each error type (Not Contextualized) and their explana-
priate) and their explanation (Generic, Hallucinated, tion (Generic, and Hallucinated) (x-axis), for Llama2C
Incoherent, and Unhelpful) (x-axis), for Llama2C and and MistralI , adapted with In-Context Learning and
MistralI , adapted with In-Context Learning and Fine- Fine-Tuning in Question Answering (QA).
Tuning in Task-Oriented Dialogues (TODs).

In this section, we report the percentage of nega-


4.3 Explaining Negative Human Judgments
tively judged responses with a certain explanation
To better understand the shortcomings of the tech- out of all the responses.
niques, we investigate the motivations provided by Open Domain Dialogue (ODD) In ODDs (Fig-
the annotators to support their negative judgments. ure 1), fine-tuning causes the generation of few
For each technique, we considered the scenario generic responses, while for in-context learning
with gold external knowledge as the theoretical none are present. Moreover, fine-tuned models gen-
upper bound (except for ODDs where no exter- erate around 30% more hallucinated responses, and
nal knowledge is required). Following the original around 25% more incoherent responses.
protocol, we consider two explanations for Not
Knowledge-Grounded Dialogue (KGD) In
Contextualized responses:
KGDs (Figure 2), fine-tuning causes the genera-
• Generic: the response is generic or does not tion of a few generic responses. Regarding hal-
contain any reference (implicit or explicit) to lucinated responses, fine-tuning slightly reduces
the dialogue history (ODD, KGD, TOD) or them for Llama2C but increases them for MistralI .
the gold knowledge (QA); Differently, fine-tuning slightly increases the inco-
• Hallucinated: the response is inconsistent herent responses for Llama2C , but has no impact
with the information contained in the dialogue for MistralI .
history (ODD, KGD, TOD) or the gold knowl- Task-Oriented Dialogue (TOD) For the TOD
edge (QA). type (Figure 3), while for MistralI fine-tuning has
no impact on generic responses, it reduces generic
Regarding Not Appropriate responses, the protocol
responses by 15% for Llama2C . For both mod-
has proposed one explanation (as an alternative to
els, fine-tuning reduces the number of hallucinated
a free-form explanation):
responses by 10%, and improves coherence by
• Incoherent: the response is not coherent with around 20% both models. It further reduces un-
the context. helpful responses by 10% for Llama2C .
Question Answering (QA) For the QA type
To better characterize errors in TODs, we propose
(Figure 4), fine-tuned models generate more
an additional explanation:
generic responses than models adapted with in-
• Unhelpful: the response candidate is not help- context learning. Instead, fine-tuning results in
ful in fulfilling the user’s request. fewer hallucinated responses for Llama2C , al-
187
though it has no effect for MistralI . Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-
mann, Trevor Cai, Eliza Rutherford, Katie Milli-
can, George Bm Van Den Driessche, Jean-Baptiste
5 Conclusion Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022.
Improving language models by retrieving from tril-
We have conducted an extensive analysis on the effi- lions of tokens. In International conference on ma-
cacy of fine-tuning and in-context learning to adapt chine learning, pages 2206–2240. PMLR.
LLMs for different dialogue types. We have ex-
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
perimented with Retrieval-Augmented Generation Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
(RAG) and gold knowledge to assess the impact Neelakantan, Pranav Shyam, Girish Sastry, Amanda
of grounding the response generation on external Askell, Sandhini Agarwal, Ariel Herbert-Voss,
knowledge. We have studied the models’ perfor- Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
mance using consistent criteria in both automatic Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
(perplexity, explainability studies) and human eval- teusz Litwin, Scott Gray, Benjamin Chess, Jack
uations. Clark, Christopher Berner, Sam McCandlish, Alec
Our study highlights the limitation of currently Radford, Ilya Sutskever, and Dario Amodei. 2020.
Language models are few-shot learners. In Ad-
available automatic metrics and the necessity of vances in Neural Information Processing Systems,
conducting human evaluations to advance human- volume 33, pages 1877–1901. Curran Associates,
machine dialogue research, as the evaluations by Inc.
human judges correlate poorly with automatic met- Qinyu Chen, Wenhao Wu, and Sujian Li. 2023. Ex-
rics. Furthermore, conducted human evaluations ploring in-context learning for knowledge grounded
indicate that there is no universal best-technique for dialog generation. In Findings of the Association
adapting LLMs to a dialogue type and the perfor- for Computational Linguistics: EMNLP 2023, pages
10071–10081, Singapore. Association for Computa-
mance of each technique depends on the base LLM tional Linguistics.
as well as the dialogue type. In addition, the cor-
rect incorporation of external knowledge depends Sukmin Cho, Jeongyeon Seo, Soyeong Jeong, and Jong
Park. 2023. Improving zero-shot reader by reduc-
on various factors such as the retriever accuracy, ing distractions from irrelevant documents in open-
the representation of the knowledge, and the pres- domain question answering. In Findings of the As-
ence of noise (non-gold) documents, as it can be sociation for Computational Linguistics: EMNLP
the least contributing element in the input vector 2023, pages 3145–3157, Singapore. Association for
Computational Linguistics.
according to explainability studies.
Emily Dinan, Stephen Roller, Kurt Shuster, Angela
Limitations Fan, Michael Auli, and Jason Weston. 2019. Wizard
of Wikipedia: Knowledge-powered conversational
Due to the limited computational resources, we agents. In Proceedings of the International Confer-
ence on Learning Representations (ICLR).
could experiment with 7B models, hampering us
in validating our findings on larger models. Fur- Zeyuan Ding, Zhihao Yang, Yinbo Qiao, and Hongfei
thermore, the reproducibility of human evaluation Lin. 2024. Kmc-tod: Structure knowledge enhanced
multi-copy network for task-oriented dialogue sys-
results may be subject to variability, due to possible tem. Knowledge-Based Systems, 293:111662.
differences in the set of crowd workers.
Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi,
Sanchit Agarwal, Shuyang Gao, Adarsh Kumar, Anuj
Acknowledgments Goyal, Peter Ku, and Dilek Hakkani-Tur. 2020. Mul-
tiWOZ 2.1: A consolidated multi-domain dialogue
We acknowledge the support of the MUR PNRR dataset with state corrections and state tracking base-
project FAIR - Future AI Research (PE00000013) lines. In Proceedings of the Twelfth Language Re-
funded by the NextGenerationEU. sources and Evaluation Conference, pages 422–428,
Marseille, France. European Language Resources
Association.

References Song Feng, Hui Wan, Chulaka Gunasekara, Siva Patel,


Sachindra Joshi, and Luis Lastras. 2020. doc2dial: A
Jason Baumgartner, Savvas Zannettou, Brian Keegan, goal-oriented document-grounded dialogue dataset.
Megan Squire, and Jeremy Blackburn. 2020. The In Proceedings of the 2020 Conference on Empirical
pushshift reddit dataset. Proceedings of the Interna- Methods in Natural Language Processing (EMNLP),
tional AAAI Conference on Web and Social Media, pages 8118–8128, Online. Association for Computa-
14(1):830–839. tional Linguistics.
188
Joseph L Fleiss. 1971. Measuring nominal scale agree- Empirical Methods in Natural Language Processing,
ment among many raters. Psychological bulletin, pages 2523–2540, Singapore. Association for Com-
76(5):378. putational Linguistics.
J.J. Godfrey, E.C. Holliman, and J. McDaniel. 1992. Vojtěch Hudeček and Ondrej Dusek. 2023. Are large
Switchboard: telephone speech corpus for research language models all you need for task-oriented dia-
and development. In [Proceedings] ICASSP-92: logue? In Proceedings of the 24th Annual Meeting
1992 IEEE International Conference on Acoustics, of the Special Interest Group on Discourse and Dia-
Speech, and Signal Processing, volume 1, pages 517– logue, pages 216–228, Prague, Czechia. Association
520 vol.1. for Computational Linguistics.
Karthik Gopalakrishnan, Behnam Hedayatnia, Qin- Gautier Izacard and Edouard Grave. 2021. Leveraging
lang Chen, Anna Gottardi, Sanjeev Kwatra, Anu passage retrieval with generative models for open do-
Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. main question answering. In Proceedings of the 16th
2019. Topical-Chat: Towards Knowledge-Grounded Conference of the European Chapter of the Associ-
Open-Domain Conversations. In Proc. Interspeech ation for Computational Linguistics: Main Volume,
2019, pages 1891–1895. pages 874–880, Online. Association for Computa-
tional Linguistics.
Gunsoo Han, Daejin Jo, Daniel Nam, Eunseop Yoon,
Taehwan Kwon, Seungeun Rho, Kyoung-Woon On, Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
Chang Yoo, and Sungwoong Kim. 2023. Efficient sch, Chris Bamford, Devendra Singh Chaplot, Diego
latent variable modeling for knowledge-grounded di- de las Casas, Florian Bressand, Gianna Lengyel, Guil-
alogue generation. In Findings of the Association laume Lample, Lucile Saulnier, Lélio Renard Lavaud,
for Computational Linguistics: EMNLP 2023, pages Marie-Anne Lachaux, Pierre Stock, Teven Le Scao,
2683–2702, Singapore. Association for Computa- Thibaut Lavril, Thomas Wang, Timothée Lacroix,
tional Linguistics. and William El Sayed. 2023. Mistral 7b.
Huang He, Hua Lu, Siqi Bao, Fan Wang, Hua Wu, Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick
Zheng-Yu Niu, and Haifeng Wang. 2024. Learning Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and
to select external knowledge with multi-scale nega- Wen-tau Yih. 2020. Dense passage retrieval for open-
tive sampling. IEEE/ACM Transactions on Audio, domain question answering. In Proceedings of the
Speech, and Language Processing, 32:714–720. 2020 Conference on Empirical Methods in Natural
Behnam Hedayatnia, Karthik Gopalakrishnan, Language Processing (EMNLP), pages 6769–6781,
Seokhwan Kim, Yang Liu, Mihail Eric, and Dilek Online. Association for Computational Linguistics.
Hakkani-Tur. 2020. Policy-driven neural response
Tomohito Kasahara, Daisuke Kawahara, Nguyen Tung,
generation for knowledge-grounded dialog systems.
Shengzhe Li, Kenta Shinzato, and Toshinori Sato.
In Proceedings of the 13th International Conference
2022. Building a personalized dialogue system with
on Natural Language Generation, pages 412–421,
prompt-tuning. In Proceedings of the 2022 Confer-
Dublin, Ireland. Association for Computational
ence of the North American Chapter of the Associ-
Linguistics.
ation for Computational Linguistics: Human Lan-
Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, guage Technologies: Student Research Workshop,
Semih Yavuz, and Richard Socher. 2020a. A sim- pages 96–105, Hybrid: Seattle, Washington + Online.
ple language model for task-oriented dialogue. In Association for Computational Linguistics.
Advances in Neural Information Processing Systems,
volume 33, pages 20179–20191. Curran Associates, Seokhwan Kim, Mihail Eric, Karthik Gopalakrishnan,
Inc. Behnam Hedayatnia, Yang Liu, and Dilek Hakkani-
Tur. 2020. Beyond domain APIs: Task-oriented con-
Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, versational modeling with unstructured knowledge
Semih Yavuz, and Richard Socher. 2020b. A sim- access. In Proceedings of the 21th Annual Meet-
ple language model for task-oriented dialogue. In ing of the Special Interest Group on Discourse and
Advances in Neural Information Processing Systems, Dialogue, pages 278–289, 1st virtual meeting. Asso-
volume 33, pages 20179–20191. Curran Associates, ciation for Computational Linguistics.
Inc.
Seokhwan Kim, Yang Liu, Di Jin, Alexandros Papan-
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan gelis, Karthik Gopalakrishnan, Behnam Hedayatnia,
Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Dilek Hakkani-Tür. 2021. “how robust r u?”:
and Weizhu Chen. 2021. Lora: Low-rank adap- Evaluating task-oriented dialogue systems on spo-
tation of large language models. arXiv preprint ken conversations. In 2021 IEEE Automatic Speech
arXiv:2106.09685. Recognition and Understanding Workshop (ASRU),
pages 1147–1154.
Qiushi Huang, Shuai Fu, Xubo Liu, Wenwu Wang, Tom
Ko, Yu Zhang, and Lilian Tang. 2023. Learning Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris
retrieval augmentation for personalized dialogue gen- Dyer, Karl Moritz Hermann, Gábor Melis, and Ed-
eration. In Proceedings of the 2023 Conference on ward Grefenstette. 2018. The NarrativeQA reading
189
comprehension challenge. Transactions of the Asso- open-domain conversations with large language mod-
ciation for Computational Linguistics, 6:317–328. els. In Proceedings of the 5th Workshop on NLP for
Conversational AI (NLP4ConvAI 2023), pages 47–
Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2022. 58, Toronto, Canada. Association for Computational
Internet-augmented dialogue generation. In Proceed- Linguistics.
ings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa- Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-
pers), pages 8460–8478, Dublin, Ireland. Association worthy, Laurent Charlin, and Joelle Pineau. 2016.
for Computational Linguistics. How NOT to evaluate your dialogue system: An
empirical study of unsupervised evaluation metrics
Jonáš Kulhánek, Vojtěch Hudeček, Tomáš Nekvinda, for dialogue response generation. In Proceedings of
and Ondřej Dušek. 2021. AuGPT: Auxiliary tasks the 2016 Conference on Empirical Methods in Natu-
and data augmentation for end-to-end dialogue with ral Language Processing, pages 2122–2132, Austin,
pre-trained language models. In Proceedings of the Texas. Association for Computational Linguistics.
3rd Workshop on Natural Language Processing for
Conversational AI, pages 198–210, Online. Associa- Ilya Loshchilov and Frank Hutter. 2017. Decou-
tion for Computational Linguistics. pled weight decay regularization. arXiv preprint
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- arXiv:1711.05101.
field, Michael Collins, Ankur Parikh, Chris Alberti,
Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken- Nicholas Meade, Spandana Gella, Devamanyu Hazarika,
ton Lee, Kristina Toutanova, Llion Jones, Matthew Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, and
Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Dilek Hakkani-Tur. 2023. Using in-context learn-
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- ing to improve dialogue safety. In Findings of the
ral questions: A benchmark for question answering Association for Computational Linguistics: EMNLP
research. Transactions of the Association for Compu- 2023, pages 11882–11910, Singapore. Association
tational Linguistics, 7:452–466. for Computational Linguistics.

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Seyed Mahed Mousavi, Simone Caldarella, and
2019. Latent retrieval for weakly supervised open Giuseppe Riccardi. 2023. Response generation in
domain question answering. In Proceedings of the longitudinal dialogues: Which knowledge represen-
57th Annual Meeting of the Association for Computa- tation helps? In Proceedings of the 5th Workshop
tional Linguistics, pages 6086–6096, Florence, Italy. on NLP for Conversational AI (NLP4ConvAI 2023),
Association for Computational Linguistics. pages 1–11, Toronto, Canada. Association for Com-
putational Linguistics.
Yoav Levine, Ori Ram, Daniel Jannai, Barak Lenz, Shai
Shalev-Shwartz, Amnon Shashua, Kevin Leyton- Seyed Mahed Mousavi, Gabriel Roccabruna, Simone
Brown, and Yoav Shoham. 2022. Huge frozen lan- Alghisi, Massimo Rizzoli, Mirco Ravanelli, and
guage models as readers for open-domain question Giuseppe Riccardi. 2024. Are llms robust for spoken
answering. In ICML 2022 Workshop on Knowledge dialogues?
Retrieval and Language Models.
Seyed Mahed Mousavi, Gabriel Roccabruna, Michela
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Lorandi, Simone Caldarella, and Giuseppe Riccardi.
Petroni, Vladimir Karpukhin, Naman Goyal, Hein- 2022. Evaluation of response generation models:
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- Shouldn’t it be shareable and replicable? In Pro-
täschel, Sebastian Riedel, and Douwe Kiela. 2020. ceedings of the 2nd Workshop on Natural Language
Retrieval-augmented generation for knowledge- Generation, Evaluation, and Metrics (GEM), pages
intensive nlp tasks. In Advances in Neural Infor- 136–147, Abu Dhabi, United Arab Emirates (Hybrid).
mation Processing Systems, volume 33, pages 9459– Association for Computational Linguistics.
9474. Curran Associates, Inc.
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Cao, and Shuzi Niu. 2017. DailyDialog: A manually Jing Zhu. 2002. Bleu: a method for automatic evalu-
labelled multi-turn dialogue dataset. In Proceedings ation of machine translation. In Proceedings of the
of the Eighth International Joint Conference on Nat- 40th Annual Meeting of the Association for Compu-
ural Language Processing (Volume 1: Long Papers), tational Linguistics, pages 311–318, Philadelphia,
pages 986–995, Taipei, Taiwan. Asian Federation of Pennsylvania, USA. Association for Computational
Natural Language Processing. Linguistics.

Chin-Yew Lin. 2004. ROUGE: A package for auto- Yushan Qian, Weinan Zhang, and Ting Liu. 2023. Har-
matic evaluation of summaries. In Text Summariza- nessing the power of large language models for em-
tion Branches Out, pages 74–81, Barcelona, Spain. pathetic response generation: Empirical investiga-
Association for Computational Linguistics. tions and improvements. In Findings of the Associ-
ation for Computational Linguistics: EMNLP 2023,
Yen-Ting Lin and Yun-Nung Chen. 2023. LLM-eval: pages 6516–6528, Singapore. Association for Com-
Unified multi-dimensional automatic evaluation for putational Linguistics.
190
Lang Qin, Yao Zhang, Hongru Liang, Jun Wang, EMNLP 2021, pages 3784–3803, Punta Cana, Do-
and Zhenglu Yang. 2023. Well begun is half minican Republic. Association for Computational
done: Generator-agnostic knowledge pre-selection Linguistics.
for knowledge-grounded dialogue. In Proceedings of
the 2023 Conference on Empirical Methods in Natu- Weiwei Sun, Pengjie Ren, and Zhaochun Ren. 2023.
ral Language Processing, pages 4696–4709, Singa- Generative knowledge selection for knowledge-
pore. Association for Computational Linguistics. grounded dialogues. In Findings of the Associa-
tion for Computational Linguistics: EACL 2023,
Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W. Bruce pages 2077–2088, Dubrovnik, Croatia. Association
Croft, and Mohit Iyyer. 2020. Open-retrieval con- for Computational Linguistics.
versational question answering. In Proceedings of
the 43rd International ACM SIGIR Conference on David Thulke, Nico Daheim, Christian Dugast, and Her-
Research and Development in Information Retrieval, mann Ney. 2024. Task-oriented document-grounded
SIGIR ’20, page 539–548, New York, NY, USA. As- dialog systems by hltpr@rwth for dstc9 and dstc10.
sociation for Computing Machinery. IEEE/ACM Transactions on Audio, Speech, and Lan-
guage Processing, 32:733–741.
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
Know what you don’t know: Unanswerable ques- Jörg Tiedemann. 2009. News from OPUS—A Collec-
tions for SQuAD. In Proceedings of the 56th Annual tion of Multilingual Parallel Corpora with Tools and
Meeting of the Association for Computational Lin- Interfaces, volume 5, pages 237–248.
guistics (Volume 2: Short Papers), pages 784–789, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Melbourne, Australia. Association for Computational bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Linguistics. Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Percy Liang. 2016. SQuAD: 100,000+ questions for Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
machine comprehension of text. In Proceedings of Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
the 2016 Conference on Empirical Methods in Natu- thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
ral Language Processing, pages 2383–2392, Austin, Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Texas. Association for Computational Linguistics. Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
Gonçalo Raposo, Luisa Coheur, and Bruno Martins. ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
2023. Prompting, retrieval, training: An exploration tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
of different approaches for task-oriented dialogue bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
generation. In Proceedings of the 24th Annual Meet- stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
ing of the Special Interest Group on Discourse and Ruan Silva, Eric Michael Smith, Ranjan Subrama-
Dialogue, pages 400–412, Prague, Czechia. Associa- nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
tion for Computational Linguistics. lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Hannah Rashkin, Eric Michael Smith, Margaret Li, and Melanie Kambadur, Sharan Narang, Aurelien Ro-
Y-Lan Boureau. 2019. Towards empathetic open- driguez, Robert Stojnic, Sergey Edunov, and Thomas
domain conversation models: A new benchmark and Scialom. 2023. Llama 2: Open foundation and fine-
dataset. In Proceedings of the 57th Annual Meet- tuned chat models.
ing of the Association for Computational Linguistics,
pages 5370–5381, Florence, Italy. Association for Weizhi Wang, Zhirui Zhang, Junliang Guo, Yinpei Dai,
Computational Linguistics. Boxing Chen, and Weihua Luo. 2022. Task-oriented
dialogue system as natural language generation. In
Ananya B. Sai, Akash Kumar Mohankumar, and Proceedings of the 45th International ACM SIGIR
Mitesh M. Khapra. 2022. A survey of evaluation Conference on Research and Development in Infor-
metrics used for nlg systems. ACM Comput. Surv., mation Retrieval, SIGIR ’22, page 2698–2703, New
55(2). York, NY, USA. Association for Computing Machin-
ery.
Gabriele Sarti, Nils Feldhus, Ludwig Sickert, and Os-
kar van der Wal. 2023. Inseq: An interpretability Thomas Wolf, Victor Sanh, Julien Chaumond, and
toolkit for sequence generation models. In Proceed- Clement Delangue. 2019. Transfertransfo: A transfer
ings of the 61st Annual Meeting of the Association learning approach for neural network based conver-
for Computational Linguistics (Volume 3: System sational agents. arXiv preprint arXiv:1901.08149.
Demonstrations), pages 421–435, Toronto, Canada.
Association for Computational Linguistics. Jing Xu, Arthur Szlam, and Jason Weston. 2022a. Be-
yond goldfish memory: Long-term open-domain con-
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, versation. In Proceedings of the 60th Annual Meet-
and Jason Weston. 2021. Retrieval augmentation ing of the Association for Computational Linguistics
reduces hallucination in conversation. In Findings (Volume 1: Long Papers), pages 5180–5197, Dublin,
of the Association for Computational Linguistics: Ireland. Association for Computational Linguistics.
191
Xinchao Xu, Zhibin Gou, Wenquan Wu, Zheng-Yu Niu, on Empirical Methods in Natural Language Process-
Hua Wu, Haifeng Wang, and Shihang Wang. 2022b. ing, pages 708–713, Brussels, Belgium. Association
Long time no see! open-domain conversation with for Computational Linguistics.
long-term persona memory. In Findings of the As-
sociation for Computational Linguistics: ACL 2022,
pages 2639–2650, Dublin, Ireland. Association for
Computational Linguistics.

Yizhe Yang, Heyan Huang, Yuhang Liu, and Yang Gao.


2023. Graph vs. sequence: An empirical study on
knowledge forms for knowledge-grounded dialogue.
In Proceedings of the 2023 Conference on Empiri-
cal Methods in Natural Language Processing, pages
15846–15858, Singapore. Association for Computa-
tional Linguistics.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,


William Cohen, Ruslan Salakhutdinov, and Christo-
pher D. Manning. 2018. HotpotQA: A dataset for
diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2369–2380, Brussels, Belgium. Association for Com-
putational Linguistics.

Qin Zhang, Shangsi Chen, Dongkuan Xu, Qingqing


Cao, Xiaojun Chen, Trevor Cohn, and Meng Fang.
2023. A survey for efficient open domain question an-
swering. In Proceedings of the 61st Annual Meeting
of the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 14447–14465, Toronto,
Canada. Association for Computational Linguistics.

Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur


Szlam, Douwe Kiela, and Jason Weston. 2018. Per-
sonalizing dialogue agents: I have a dog, do you
have pets too? In Proceedings of the 56th Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 2204–2213,
Melbourne, Australia. Association for Computational
Linguistics.

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen,


Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing
Liu, and Bill Dolan. 2020. DIALOGPT : Large-scale
generative pre-training for conversational response
generation. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics:
System Demonstrations, pages 270–278, Online. As-
sociation for Computational Linguistics.

Chao Zhao, Spandana Gella, Seokhwan Kim, Di Jin, De-


vamanyu Hazarika, Alexandros Papangelis, Behnam
Hedayatnia, Mahdi Namazifar, Yang Liu, and Dilek
Hakkani-Tur. 2023. “what do others think?”: Task-
oriented conversational modeling with subjective
knowledge. In Proceedings of the 24th Annual Meet-
ing of the Special Interest Group on Discourse and
Dialogue, pages 309–323, Prague, Czechia. Associa-
tion for Computational Linguistics.

Kangyan Zhou, Shrimai Prabhumoye, and Alan W


Black. 2018. A dataset for document grounded con-
versations. In Proceedings of the 2018 Conference
192
A Appendix of 10 epochs with an early stopping patience of 2.
We chose AdamW (Loshchilov and Hutter, 2017)
A.1 Datasets
as the optimizer and used a learning rate of 10−4
We briefly present the reasons for selecting the for Llama2C and 10−5 for MistralI (selected based
datasets. on the performance on the development sets). To
Open-Domain Dialogue (ODD) Differently obtain an encoding for both documents and queries,
from other datasets, DailyDialog dialogues only we used all-mpnet-base-v26 . We have then stored
involve two participants (Tiedemann, 2009; Baum- the encoded documents in a FAISS vector store
gartner et al., 2020), are not audio transcrip- (used for retrieval).
tions (Godfrey et al., 1992), have more than two Input structure We separated the segments of
exchanges between the participants (Rashkin et al., the input vector with their name followed by a
2019), and are not restricted by a persona (i.e. few colon (i.e. "Dialogue state:", "Topic:", "Knowl-
sentences describing the user’s interests) (Zhang edge:", "Question:", "Answer:") similarly to pre-
et al., 2018; Xu et al., 2022a). vious work (Izacard and Grave, 2021; Wang
Knowledge-Grounded Dialogue (KGD) Wiz- et al., 2022; Chen et al., 2023; Sun et al., 2023).
ard of Wikipedia provides a test set with an un- For TOD, we represented the dialogue state as
seen set of documents (Zhou et al., 2018; Komeili a comma-separated list of domain slot value
et al., 2022) and its knowledge has not changed triplets (Hosseini-Asl et al., 2020b; Wang et al.,
over time (i.e. comparable with previous/future 2022).
studies) (Gopalakrishnan et al., 2019; Hedayatnia Instructions Table 5 reports the instructions
et al., 2020). used for in-context learning experiments. For each
Task-Oriented Dialogue (TOD) A few other dialogue type, we have experimented with three
TOD datasets include unstructured knowledge ac- different instructions describing the task and the
cess but consist only of a spoken test set (Kim various input segments (e.g. dialogue history, topic,
et al., 2021), or provide no dialogue state annota- and knowledge). We have selected the best instruc-
tion (Feng et al., 2020). The dataset proposed in tion based on the development set performance.
the ninth Dialogue System Technology Challenge Generation We sampled 10% of the data (in
augmented MultiWOZ 2.1 (Eric et al., 2020) with a stratified fashion, based on the length of the re-
knowledge access turns but removed the dialogue sponses) from the development set of each dialogue
state annotation. To always include the dialogue type. For each model, we used grid search to find,
state in our analysis, we recovered the dialogue for the sampled data, the combination of parame-
state annotation from the original MultiWOZ 2.1 ters (top-p, top-k, and temperature) leading to the
dialogues, and we only considered the dialogues highest BLEU-4. The best combination of parame-
from this dataset. ters was used to generate the responses for the test
Question Answering (QA) We choose Narra- set.
tiveQA because it has a publicly available test GPU Requirements Most computations were
set (to evaluate the retriever) and answers are ex- performed on a single NVIDIA A100 GPU with
pressed as free-form text (to evaluate response gen- 80GB, requiring less than 50 hours to execute. In a
eration) (Rajpurkar et al., 2016, 2018; Yang et al., few cases, we had to use two (i.e. fine-tuning the
2018; Kwiatkowski et al., 2019). Although the orig- models for QA using more than one document) or
inal task always provides the correct document, we three (i.e. integrated gradients) A100 with 80GB
also wanted to investigate the performance of the each.
retriever when considering documents with an aver-
age length of 600 tokens. Additionally, we avoided A.3 Additional Automatic Evaluation
splitting documents into smaller chunks (e.g. pas- To automatically evaluate the quality of the gen-
sages or sentences) because this would have made erated text, we have considered BLEU-4 (Pap-
the computation of the retriever performance more ineni et al., 2002), F1 (i.e. unigram overlap), and
challenging. ROUGE-L (Lin, 2004). Furthermore, we have used
KF1 (Shuster et al., 2021) to measure the overlap
A.2 Implementation and resources between the prediction and the knowledge selected
Models and parameters We fine-tuned the models 6
https://www.sbert.net/docs/pretrained_models.
using LoRA (rank 32 and alpha 64) for a maximum html
193
Dialogue Type Instruction
""

ODD "This is a conversation between two people. Use the context to write an engaging
reply for the other person."
"Write a coherent continuation for the proposed conversation."
""
"This is a conversation between two people about a Topic. Use the Dialogue and the
KGD additional Knowledge as context to write an engaging reply for the other person.",
"Write a coherent continuation for the proposed conversation based on the additional
Knowledge."
""

TOD "In the following conversation a user wants to achieve some goal and needs help from
an assistant. Continue the conversation with the response of the assistant."
"Write a coherent continuation for the proposed conversation."
""

QA "You are presented with a user’s Question about a movie or book. Answer to the user’s
Question using the information provided in the Context."
"Answer to the user’s question using the provided information (if available)."

Table 5: Instructions used to adapt the model to a specific dialogue type with in-context learning. We defined three
instructions for each dialogue type, describing the task and the various input segments (e.g. dialogue history, topic,
dialogue state, and knowledge). We selected the best instruction based on the development set performance.

by the annotators. For reproducibility purposes, we obtained the best performance in terms of BLEU-
have computed ROUGE-L using the official im- 4 when fine-tuned with no additional knowledge.
plementation7 and all the remaining metrics using Further investigation suggests this happens because
ParlAI8 . No pre-processing was performed on the of the high overlap between the knowledge used
model-generated answers. for training and testing (82%). We report the per-
Table 6 reports the performance for each dia- formance on the documents only available in the
logue type. As mentioned in Section 4.1, the best test phase in Table 7 (TOD† ). In this scenario, gold
performance is obtained by fine-tuned models. Fol- knowledge does indeed increase the performance
lowing, we analyze the results for each dialogue of the models.
type. Question Answering (QA) Although fine-tuned
Open-Domain Dialogue (ODD) Although fine- models achieve the highest ROUGE-L, in-context
tuning achieves a higher BLEU-4, the results show learning models tend to provide longer and possibly
that both techniques produce very different re- more detailed responses, as reported in terms of
sponses with respect to the ground truth. KF1. Because ground truths are particularly short
Knowledge-Grounded Dialogue (KGD) We re- (4.26 tokens on average), models that generated
port the performance of the models on the unseen longer responses (especially models adapted with
test set (i.e. the knowledge base contains docu- in-context learning) were awarded a lower ROUGE-
ments that are only present in the test set). The L.
results show that models adapted using fine-tuning
obtain a higher F1 than in-context learning. Fur- A.3.1 Retriever Accuracy
thermore, the best models tend to copy more from We study the performance of the retriever for each
the gold knowledge compared to the annotators (as dialogue type and report Recall@K in Figure 5.
shown in the ground truth). Because of the size of the knowledge base (Table
Task-Oriented Dialogue (TOD) Differently 1), the retriever achieves the lowest performance on
from the other types, Llama2C and MistralI have TOD. However, although the knowledge base for
QA is bigger than for KGD, the retriever achieves
7
https://github.com/google-research/
google-research/tree/master/rouge a higher recall for QA. Further study suggest that,
8
https://parl.ai although the retriever selects the gold sentence in
194
External BLEU-4 KF1 F1 ROUGE-L
Model Technique
Knowledge
ODD TOD KGD TOD QA KGD QA
No Know. 0.2 0.85 11.61 13.66 5.26 12.68 5.59
In-Context Learning Retrieved Know. 0.83 13.51 12.10 5.65 12.91 14.86
Gold Know. 1.07 25.87 21.03 6.72 16.59 23.22
Llama2C
No Know. 0.3 6.72 17.43 34.04 0.74 18.46 17.25
Fine-Tuning Retrieved Know. 4.33 25.10 26.85 1.15 20.70 46.21
Gold Know. 5.39 76.23 42.69 1.44 38.41 73.38
No Know. 0.2 1.33 10.96 13.01 4.84 11.04 6.94
In-Context Learning Retrieved Know. 1.06 13.83 12.53 6.09 12.22 10.26
Gold Know. 1.33 25.95 28.74 7.07 15.88 21.74
MistralI
No Know. 0.9 4.09 15.47 29.27 0.67 18.63 12.73
Fine-Tuning Retrieved Know. 3.85 21.63 30.44 1.18 20.49 45.40
Gold Know. 3.94 68.36 43.04 1.46 38.21 70.54
Ground Truth 100 100 37.79 38.48 1.52 100 100

Table 6: Automatic Evaluation BLEU-4, KF1, F1 and ROUGE-L for In-Context Learning and Fine-Tuning with
Retrieved (top-3) and Gold (ground-truth) knowledge, on Llama2C and MistralI , in different dialogue types:
Open-Domain Dialogues (ODDs), Knowledge Grounded Dialogues (KGDs), Task-Oriented Dialogues (TODs), and
Question Answering (QA).

External BLEU-4 KF1


Model Technique
Knowledge
TOD TOD†
TOD TOD†
No Know. 0.85 0.60 13.66 12.39
In-Context Learning Retrieved Know. 0.83 0.44 12.10 10.44
Gold Know. 1.07 2.67 25.87 23.77
Llama2C
No Know. 6.72 4.33 34.04 25.73
Fine-Tuning Retrieved Know. 4.33 3.15 26.85 22.92
Gold Know. 5.39 8.50 42.69 45.49
No Know. 1.33 1.12 13.01 11.91
In-Context Learning Retrieved Know. 1.06 1.02 12.53 10.36
Gold Know. 1.33 3.70 28.74 28.79
MistralI
No Know. 4.09 5.83 29.27 25.47
Fine-Tuning Retrieved Know. 3.85 4.76 30.44 25.61
Gold Know. 3.94 10.63 43.04 49.40
Ground Truth 100 100 38.48 39.91

Table 7: Automatic Evaluation BLEU-4 and KF1 for In-Context Learning and Fine-Tuning with Retrieved
(top-3) and Gold (ground-truth) knowledge, on Llama2C and MistralI , in Task-Oriented Dialogues (TODs). †
indicates that only test turns with unseen knowledge were included.

only a few cases, the model retrieves a sentence


from the same paragraph more than 69% of the
time.

A.4 Human Evaluation


Table 8 reports the results for the "Correctness"
dimension of Human Evaluations. Except for ODD,
fine-tuning tends to improve correctness.
Table 9 presents the question and the answer
options for the proposed "Validity" dimension used
in QA.

195
External Correctness
Model Technique
Knowledge
ODD KGD TOD QA
No Know. 95 80 95 75
In-Context Learning Retrieved Know. 80 60 60
Gold Know. 80 70 80
Llama2C
No Know. 65 90 70 75
Fine-Tuning Retrieved Know. 90 90 55
Gold Know. 85 85 85
No Know. 95 70 75 60
In-Context Learning Retrieved Know. 55 70 50
Gold Know. 85 60 80
MistralI
No Know. 65 85 80 50
Fine-Tuning Retrieved Know. 75 100 45
Gold Know. 70 80 85
Ground-Truth 95 70 85 80

Table 8: Human Evaluation Percentage of Correct (ODD, KGD, TOD, QA) responses for In-Context Learning and
Fine-Tuning with Retrieved (top-3) and Gold (ground-truth) knowledge, on Llama2C and MistralI , for different
dialogue types: Open-Domain Dialogues (ODDs), Knowledge Grounded Dialogues (KGDs), Task-Oriented
Dialogues (TODs), and Question Answering (QA).

Dimension Question Answer Option Option Definition


The response candidate includes the right information from the context
Valid
to adequately answer the proposed question.
Is the response The response candidate does not include the right information from
Validity
candidate valid? Not Valid
the context to adequately answer the proposed question.
The response candidate includes some information that is adequate to
I don’t know
answer the proposed question, but some that is not.

Table 9: Question and answer options presented to the annotators for the proposed Validity dimension.

196
Recall@K for different Dialogue Settings
100 QA
TOD
90 KGD (Sentence)
80 KGD (Paragraph)
Recall@K (%)

70
60
50
40
30
20
10
0
1 3 5
Number of documents retrieved (K)
Figure 5: Performance of the off-the-shelf retriever for
each dialogue type. The retriever achieves the lowest
Recall@K on TOD because of the larger knowledge
base size (2900 documents). However, the retriever
achieves a higher Recall@K for QA, even though its
knowledge base is bigger than the one for KGD (355
vs. 61 ± 21). Further studies indicate that, despite the
model is not capable to retrieve the exact sentence of
the annotator (KGD Sentence), the retriever selects a
sentence belonging to the same paragraph more than
69% of the time (KGD Paragraph).

197

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy