2024 Inlg-Main 15
2024 Inlg-Main 15
Table 2: Automatic Evaluation Perplexity of Fine-Tuning and In-Context Learning with Retrieved (top-3) and
Gold (ground-truth) knowledge, on Llama2C and MistralI , in different dialogue types: Open-Domain Dialogues
(ODDs), Knowledge Grounded Dialogues (KGDs), Task-Oriented Dialogues (TODs), and Question Answering
(QA). Results for fine-tuned models report mean and standard deviation over three runs.
We use an off-the-shelf retriever2 (model details in for dialogues. We select Llama2C and MistralI as
§A.2) to retrieve documents from the unstructured base LLMs and experiment in four dialogue types:
knowledge base. First, we encode all the docu- ODDs, KGDs, TODs, and QA. For each dialogue
ments considering their content together with their type, we study the impact of external knowledge,
topic (KGD), place or service name (TOD), or title both retrieved and gold. Further details about the
(QA) (Karpukhin et al., 2020). Then, at each turn, implementation and the resources used are avail-
we retrieve the k most similar documents based on able in the appendix (§A.2).
L2 distance with the encoded context. Finally, we
feed the retrieved documents to the base models 4.1 Automatic Evaluation
together with the context to generate a response. Currently available automatic metrics used for the
In the gold knowledge scenario, we directly task of response generation are not interpretable
feed the model with the ground truth documents. and correlate poorly with human judgments (Liu
This serves as an upper bound for RAG. Addition- et al., 2016; Sai et al., 2022; Mousavi et al., 2022).
ally, this strategy allows us to study the ability of Therefore, we focus on perplexity as it is derived
the techniques to incorporate knowledge in the re- from the objective function used to fine-tune the
sponses. models, and present other metrics in §A.3.
Table 2 reports the perplexity of Llama2C and
3.4 Models
MistralI on the test set of each dialogue type. In
We select the widely-used 7B version of Llama2C all dialogue types, fine-tuned models have obtained
and MistralI as base models. For in-context learn- better performance compared to in-context learning.
ing, we experiment with three instructions for each When considering the impact of external knowl-
dialogue type and select the best based on the de- edge, models fine-tuned on TODs show that knowl-
velopment set performance. For fine-tuning, we edge slightly increases perplexity. The high per-
use LoRA, a parameter-efficient technique that has plexity obtained by in-context learning models on
shown comparable performance to fine-tuning all QA can be explained by two reasons: first, be-
parameters (Hu et al., 2021). Further details about sides the knowledge, only the question is used as
the parameters are reported in §A.2. context; second, while the ground truths are partic-
ularly short (4.26 tokens on average), these models
4 Evaluation generate long responses, making them unlikely to
We conduct a comparative study on the impact of include the correct answer in the first few tokens.
in-context learning and fine-tuning to adapt LLMs This does not happen for fine-tuned models since
they are trained to generate shorter responses. Nev-
2
https://github.com/langchain-ai/langchain ertheless, the best results have been obtained with
183
% of Tokens w. Significant Contribution in Each Segment
Dialogue
Model Technique
Type Topic/Dialogue Dialogue
Instruction Knowledge
State History
In-Context Learning 21.85 28.60 15.97 33.58
KGD
Llama2C Fine-Tuning 39.43 13.80 46.77
In-Context Learning 25.98 19.54 16.46 38.02
TOD
Fine-Tuning 27.19 8.04 64.77
In-Context Learning 69.01 14.89 16.10
KGD
MistralI Fine-Tuning 65.55 11.00 23.45
In-Context Learning 69.05 10.19 11.24 9.52
TOD
Fine-Tuning 14.55 29.06 56.39
Table 3: Explanability Study Percentage of tokens with significant contribution to the generation in different
segments of the input vector for each model in Knowledge-Grounded Dialogues (KGDs), and Task-Oriented
Dialogues (TODs). All rows sum to 100. For KGD, the second column reports the contribution of the Topic,
while for TOD it reports the contribution of the Dialogue State. The Instruction segment is only present for
In-Context Learning.
gold knowledge. We report automatic evaluation re- the generated responses to gain more insight into
sults including retriever accuracy, overlap between the models’ performance. Mousavi et al. (2022)
knowledge and response tokens, and other auto- proposed four dimensions to evaluate response gen-
matic metrics in §A.3. eration based on the most common errors and quali-
ties. We evaluate the responses using their protocol
4.1.1 Explainability Study and three of their dimensions:
To understand the contribution of each segment of
the input vector (i.e. instruction, context, knowl- • Contextualization: the response includes ex-
edge, topic, and dialogue state), we compute inte- plicit or implicit references to the dialogue
grated gradients (Sarti et al., 2023)3 of input ele- history (ODD, KGD, TOD) or the gold knowl-
ments and select the most contributing input tokens edge (QA);
(top-25%). Table 3 reports the percentage of most • Appropriateness: the response is coherent
contributing tokens that fall in each segment (nor- and makes sense as a continuation of the dia-
malized by the length of the segment). In general, logue;
in both KGD and TOD, the dialogue history is the • Correctness: the response is grammatically
least contributing segment, which might indicate and syntactically correct.
that only a part of the history is significant for re-
According to these dimensions, we evaluate the re-
sponse generation. On the other hand, in KGD
sponses for all techniques, models, and knowledge
the topic has a higher score than the dialogue his-
scenarios, in all dialogue types. The only excep-
tory, suggesting its importance for response gener-
tion is QA, where we do not evaluate "Appropri-
ation for this dialogue type. Interestingly, MistralI
ateness" since the dimension considers coherence
gives considerably more importance to the topic
with respect to a dialogue history but QA only has
than Llama2C , decreasing the importance of the
question-answer exchanges. Instead, we extend the
knowledge segment. For the TOD type, the most
protocol4 by proposing a new dimension for QA:
contributing segment is often the knowledge, reach-
ing over 50% with fine-tuning. This suggests that • Validity: the response includes adequate in-
knowledge is more relevant for TOD and that rele- formation to answer the question.
vance changes with respect to the dialogue type.
For TOD we do not include a dimension to evaluate
4.2 Human Evaluation whether the response is in line with user require-
Considering the uninterpretability of automatic ments, as this can be measured automatically (via
evaluations, we conducted a human evaluation of 4
The extended protocol is available at https://github.
com/sislab-unitn/Human-Evaluation-Protocol/tree/
3
We use Inseq to compute integrated gradients. v1.1
184
External Contextualization Appropriateness Validity
Model Technique
Knowledge
ODD KGD TOD QA ODD KGD TOD QA
No Know. 85 70 70 50 80 70 60 10
In-Context Learning Retrieved Know. 75 65 70 75 45 35
Gold Know. 90 40 90 85 45 80
Llama2C
No Know. 45 60 70 15 50 65 60 15
Fine-Tuning Retrieved Know. 65 90 45 80 80 45
Gold Know. 80 85 85 65 85 75
No Know. 90 80 70 20 85 85 65 20
In-Context Learning Retrieved Know. 75 65 40 65 60 25
Gold Know. 90 55 75 70 55 80
MistralI
No Know. 55 90 85 25 55 80 80 20
Fine-Tuning Retrieved Know. 95 85 30 85 90 40
Gold Know. 80 75 70 65 70 70
Ground-Truth 95 80 95 90 100 85 95 90
Table 4: Human Evaluation Percentage of Contextualized, Appropriate (ODD, KGD, TOD), and Valid (QA)
responses for In-Context Learning and Fine-Tuning with Retrieved (top-3) and Gold (ground-truth) knowledge,
on Llama2C and MistralI , in different dialogue types: Open-Domain Dialogues (ODDs), Knowledge Grounded
Dialogues (KGDs), Task-Oriented Dialogues (TODs), and Question Answering (QA).
dialogue state tracking metrics e.g., Joint Goal Ac- different dialogue types. As for ODDs, we report
curacy). The dimensions can either have a positive no results for the Retrieved and Gold Knowledge
or negative answer value, as well as "I don’t know" scenarios since no knowledge was used for this dia-
to avoid forcing erroneous judgments on any of the logue type. Additional results on "Correctness" are
two sides. For "Contextualization" and "Appropri- reported in §A.4.
ateness", we also ask the annotators to motivate the Open-Domain Dialogue (ODD) Models fine-
negative judgments with the explanations proposed tuned for ODD tend to generate considerably less
in the original protocol. We present the explana- contextualized responses than models adapted us-
tions and related results in §4.3. ing in-context learning. In particular, fine-tuning
We recruited 75 annotators on the Prolific plat- Llama2C reduces contextualization by 40%, while
form5 , and we assigned 5 dialogues to each annota- for MistralI by 35%. Similarly, fine-tuning reduces
tor. After performing quality control, we approved their appropriateness by 30% compared to their
65 annotators with a compensation of 9.00£/hour in-context learning version. This contrasts with
(marked as good on the Prolific platform). Due automatic evaluation (Table 2), where in-context
to the large number of responses, each annotator learning obtained a higher perplexity (i.e. worse
evaluated a different set of model responses for a results) compared to fine-tuning.
given dialogue. For the purpose of quality control, Knowledge-Grounded Dialogue (KGD) Con-
for each dialogue type, two dialogues were overlap- cerning KGD, the results are model-dependent.
ping among five annotators, while the remaining When considering Llama2C , in-context learning
dialogues were annotated by one crowd-worker provides, regardless of the knowledge, 10% more
with an overlap only on the ground truth. The contextualized responses compared to fine-tuning.
inter-annotator agreement measured with Fleiss’ κ On the other hand, fine-tuning MistralI on Re-
(Fleiss, 1971) was 0.65 (substantial agreement). trieved Knowledge leads to the highest contextu-
As results of the human evaluation (Table 4), alization (95%). However, using Gold instead of
we report the percentage of positively judged re- Retrieved Knowledge reduces the contextualiza-
sponses (Contextualized, Appropriate, Valid) for tion of the fine-tuned model by 15%. Furthermore,
Llama2C and MistralI when considering different when considering the best models, Llama2C and
adaptation techniques (Fine-Tuning and In-Context MistralI have a higher contextualization than the
Learning) and knowledge (No Knowledge, Re- ground truth (10 to 15%), suggesting that models
trieved Knowledge, and Gold Knowledge) across copy more from the dialogue history. Similarly
to contextualization, adapting Llama2C with in-
5
https://www.prolific.com/ context learning and Gold Knowledge provides
185
Not Contextualized Not Appropriate Not Contextualized Not Appropriate
Figure 1: Percentage of LLM responses (y-axis) for Figure 2: Percentage of LLM responses (y-axis) for
each error type (Not Contextualized and Not Appro- each error type (Not Contextualized and Not Appro-
priate) and their explanation (Generic, Hallucinated, priate) and their explanation (Generic, Hallucinated,
and Incoherent) (x-axis), for Llama2C and MistralI , and Incoherent) (x-axis), for Llama2C and MistralI ,
adapted with In-Context Learning and Fine-Tuning in adapted with In-Context Learning and Fine-Tuning in
Open-Domain Dialogues (ODDs). Knowledge-Grounded Dialogues (KGDs).
Figure 3: Percentage of LLM responses (y-axis) for Figure 4: Percentage of LLM responses (y-axis) for
each error type (Not Contextualized and Not Appro- each error type (Not Contextualized) and their explana-
priate) and their explanation (Generic, Hallucinated, tion (Generic, and Hallucinated) (x-axis), for Llama2C
Incoherent, and Unhelpful) (x-axis), for Llama2C and and MistralI , adapted with In-Context Learning and
MistralI , adapted with In-Context Learning and Fine- Fine-Tuning in Question Answering (QA).
Tuning in Task-Oriented Dialogues (TODs).
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Seyed Mahed Mousavi, Simone Caldarella, and
2019. Latent retrieval for weakly supervised open Giuseppe Riccardi. 2023. Response generation in
domain question answering. In Proceedings of the longitudinal dialogues: Which knowledge represen-
57th Annual Meeting of the Association for Computa- tation helps? In Proceedings of the 5th Workshop
tional Linguistics, pages 6086–6096, Florence, Italy. on NLP for Conversational AI (NLP4ConvAI 2023),
Association for Computational Linguistics. pages 1–11, Toronto, Canada. Association for Com-
putational Linguistics.
Yoav Levine, Ori Ram, Daniel Jannai, Barak Lenz, Shai
Shalev-Shwartz, Amnon Shashua, Kevin Leyton- Seyed Mahed Mousavi, Gabriel Roccabruna, Simone
Brown, and Yoav Shoham. 2022. Huge frozen lan- Alghisi, Massimo Rizzoli, Mirco Ravanelli, and
guage models as readers for open-domain question Giuseppe Riccardi. 2024. Are llms robust for spoken
answering. In ICML 2022 Workshop on Knowledge dialogues?
Retrieval and Language Models.
Seyed Mahed Mousavi, Gabriel Roccabruna, Michela
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Lorandi, Simone Caldarella, and Giuseppe Riccardi.
Petroni, Vladimir Karpukhin, Naman Goyal, Hein- 2022. Evaluation of response generation models:
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- Shouldn’t it be shareable and replicable? In Pro-
täschel, Sebastian Riedel, and Douwe Kiela. 2020. ceedings of the 2nd Workshop on Natural Language
Retrieval-augmented generation for knowledge- Generation, Evaluation, and Metrics (GEM), pages
intensive nlp tasks. In Advances in Neural Infor- 136–147, Abu Dhabi, United Arab Emirates (Hybrid).
mation Processing Systems, volume 33, pages 9459– Association for Computational Linguistics.
9474. Curran Associates, Inc.
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Cao, and Shuzi Niu. 2017. DailyDialog: A manually Jing Zhu. 2002. Bleu: a method for automatic evalu-
labelled multi-turn dialogue dataset. In Proceedings ation of machine translation. In Proceedings of the
of the Eighth International Joint Conference on Nat- 40th Annual Meeting of the Association for Compu-
ural Language Processing (Volume 1: Long Papers), tational Linguistics, pages 311–318, Philadelphia,
pages 986–995, Taipei, Taiwan. Asian Federation of Pennsylvania, USA. Association for Computational
Natural Language Processing. Linguistics.
Chin-Yew Lin. 2004. ROUGE: A package for auto- Yushan Qian, Weinan Zhang, and Ting Liu. 2023. Har-
matic evaluation of summaries. In Text Summariza- nessing the power of large language models for em-
tion Branches Out, pages 74–81, Barcelona, Spain. pathetic response generation: Empirical investiga-
Association for Computational Linguistics. tions and improvements. In Findings of the Associ-
ation for Computational Linguistics: EMNLP 2023,
Yen-Ting Lin and Yun-Nung Chen. 2023. LLM-eval: pages 6516–6528, Singapore. Association for Com-
Unified multi-dimensional automatic evaluation for putational Linguistics.
190
Lang Qin, Yao Zhang, Hongru Liang, Jun Wang, EMNLP 2021, pages 3784–3803, Punta Cana, Do-
and Zhenglu Yang. 2023. Well begun is half minican Republic. Association for Computational
done: Generator-agnostic knowledge pre-selection Linguistics.
for knowledge-grounded dialogue. In Proceedings of
the 2023 Conference on Empirical Methods in Natu- Weiwei Sun, Pengjie Ren, and Zhaochun Ren. 2023.
ral Language Processing, pages 4696–4709, Singa- Generative knowledge selection for knowledge-
pore. Association for Computational Linguistics. grounded dialogues. In Findings of the Associa-
tion for Computational Linguistics: EACL 2023,
Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W. Bruce pages 2077–2088, Dubrovnik, Croatia. Association
Croft, and Mohit Iyyer. 2020. Open-retrieval con- for Computational Linguistics.
versational question answering. In Proceedings of
the 43rd International ACM SIGIR Conference on David Thulke, Nico Daheim, Christian Dugast, and Her-
Research and Development in Information Retrieval, mann Ney. 2024. Task-oriented document-grounded
SIGIR ’20, page 539–548, New York, NY, USA. As- dialog systems by hltpr@rwth for dstc9 and dstc10.
sociation for Computing Machinery. IEEE/ACM Transactions on Audio, Speech, and Lan-
guage Processing, 32:733–741.
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
Know what you don’t know: Unanswerable ques- Jörg Tiedemann. 2009. News from OPUS—A Collec-
tions for SQuAD. In Proceedings of the 56th Annual tion of Multilingual Parallel Corpora with Tools and
Meeting of the Association for Computational Lin- Interfaces, volume 5, pages 237–248.
guistics (Volume 2: Short Papers), pages 784–789, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Melbourne, Australia. Association for Computational bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Linguistics. Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Percy Liang. 2016. SQuAD: 100,000+ questions for Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
machine comprehension of text. In Proceedings of Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
the 2016 Conference on Empirical Methods in Natu- thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
ral Language Processing, pages 2383–2392, Austin, Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Texas. Association for Computational Linguistics. Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
Gonçalo Raposo, Luisa Coheur, and Bruno Martins. ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
2023. Prompting, retrieval, training: An exploration tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
of different approaches for task-oriented dialogue bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
generation. In Proceedings of the 24th Annual Meet- stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
ing of the Special Interest Group on Discourse and Ruan Silva, Eric Michael Smith, Ranjan Subrama-
Dialogue, pages 400–412, Prague, Czechia. Associa- nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
tion for Computational Linguistics. lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Hannah Rashkin, Eric Michael Smith, Margaret Li, and Melanie Kambadur, Sharan Narang, Aurelien Ro-
Y-Lan Boureau. 2019. Towards empathetic open- driguez, Robert Stojnic, Sergey Edunov, and Thomas
domain conversation models: A new benchmark and Scialom. 2023. Llama 2: Open foundation and fine-
dataset. In Proceedings of the 57th Annual Meet- tuned chat models.
ing of the Association for Computational Linguistics,
pages 5370–5381, Florence, Italy. Association for Weizhi Wang, Zhirui Zhang, Junliang Guo, Yinpei Dai,
Computational Linguistics. Boxing Chen, and Weihua Luo. 2022. Task-oriented
dialogue system as natural language generation. In
Ananya B. Sai, Akash Kumar Mohankumar, and Proceedings of the 45th International ACM SIGIR
Mitesh M. Khapra. 2022. A survey of evaluation Conference on Research and Development in Infor-
metrics used for nlg systems. ACM Comput. Surv., mation Retrieval, SIGIR ’22, page 2698–2703, New
55(2). York, NY, USA. Association for Computing Machin-
ery.
Gabriele Sarti, Nils Feldhus, Ludwig Sickert, and Os-
kar van der Wal. 2023. Inseq: An interpretability Thomas Wolf, Victor Sanh, Julien Chaumond, and
toolkit for sequence generation models. In Proceed- Clement Delangue. 2019. Transfertransfo: A transfer
ings of the 61st Annual Meeting of the Association learning approach for neural network based conver-
for Computational Linguistics (Volume 3: System sational agents. arXiv preprint arXiv:1901.08149.
Demonstrations), pages 421–435, Toronto, Canada.
Association for Computational Linguistics. Jing Xu, Arthur Szlam, and Jason Weston. 2022a. Be-
yond goldfish memory: Long-term open-domain con-
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, versation. In Proceedings of the 60th Annual Meet-
and Jason Weston. 2021. Retrieval augmentation ing of the Association for Computational Linguistics
reduces hallucination in conversation. In Findings (Volume 1: Long Papers), pages 5180–5197, Dublin,
of the Association for Computational Linguistics: Ireland. Association for Computational Linguistics.
191
Xinchao Xu, Zhibin Gou, Wenquan Wu, Zheng-Yu Niu, on Empirical Methods in Natural Language Process-
Hua Wu, Haifeng Wang, and Shihang Wang. 2022b. ing, pages 708–713, Brussels, Belgium. Association
Long time no see! open-domain conversation with for Computational Linguistics.
long-term persona memory. In Findings of the As-
sociation for Computational Linguistics: ACL 2022,
pages 2639–2650, Dublin, Ireland. Association for
Computational Linguistics.
ODD "This is a conversation between two people. Use the context to write an engaging
reply for the other person."
"Write a coherent continuation for the proposed conversation."
""
"This is a conversation between two people about a Topic. Use the Dialogue and the
KGD additional Knowledge as context to write an engaging reply for the other person.",
"Write a coherent continuation for the proposed conversation based on the additional
Knowledge."
""
TOD "In the following conversation a user wants to achieve some goal and needs help from
an assistant. Continue the conversation with the response of the assistant."
"Write a coherent continuation for the proposed conversation."
""
QA "You are presented with a user’s Question about a movie or book. Answer to the user’s
Question using the information provided in the Context."
"Answer to the user’s question using the provided information (if available)."
Table 5: Instructions used to adapt the model to a specific dialogue type with in-context learning. We defined three
instructions for each dialogue type, describing the task and the various input segments (e.g. dialogue history, topic,
dialogue state, and knowledge). We selected the best instruction based on the development set performance.
by the annotators. For reproducibility purposes, we obtained the best performance in terms of BLEU-
have computed ROUGE-L using the official im- 4 when fine-tuned with no additional knowledge.
plementation7 and all the remaining metrics using Further investigation suggests this happens because
ParlAI8 . No pre-processing was performed on the of the high overlap between the knowledge used
model-generated answers. for training and testing (82%). We report the per-
Table 6 reports the performance for each dia- formance on the documents only available in the
logue type. As mentioned in Section 4.1, the best test phase in Table 7 (TOD† ). In this scenario, gold
performance is obtained by fine-tuned models. Fol- knowledge does indeed increase the performance
lowing, we analyze the results for each dialogue of the models.
type. Question Answering (QA) Although fine-tuned
Open-Domain Dialogue (ODD) Although fine- models achieve the highest ROUGE-L, in-context
tuning achieves a higher BLEU-4, the results show learning models tend to provide longer and possibly
that both techniques produce very different re- more detailed responses, as reported in terms of
sponses with respect to the ground truth. KF1. Because ground truths are particularly short
Knowledge-Grounded Dialogue (KGD) We re- (4.26 tokens on average), models that generated
port the performance of the models on the unseen longer responses (especially models adapted with
test set (i.e. the knowledge base contains docu- in-context learning) were awarded a lower ROUGE-
ments that are only present in the test set). The L.
results show that models adapted using fine-tuning
obtain a higher F1 than in-context learning. Fur- A.3.1 Retriever Accuracy
thermore, the best models tend to copy more from We study the performance of the retriever for each
the gold knowledge compared to the annotators (as dialogue type and report Recall@K in Figure 5.
shown in the ground truth). Because of the size of the knowledge base (Table
Task-Oriented Dialogue (TOD) Differently 1), the retriever achieves the lowest performance on
from the other types, Llama2C and MistralI have TOD. However, although the knowledge base for
QA is bigger than for KGD, the retriever achieves
7
https://github.com/google-research/
google-research/tree/master/rouge a higher recall for QA. Further study suggest that,
8
https://parl.ai although the retriever selects the gold sentence in
194
External BLEU-4 KF1 F1 ROUGE-L
Model Technique
Knowledge
ODD TOD KGD TOD QA KGD QA
No Know. 0.2 0.85 11.61 13.66 5.26 12.68 5.59
In-Context Learning Retrieved Know. 0.83 13.51 12.10 5.65 12.91 14.86
Gold Know. 1.07 25.87 21.03 6.72 16.59 23.22
Llama2C
No Know. 0.3 6.72 17.43 34.04 0.74 18.46 17.25
Fine-Tuning Retrieved Know. 4.33 25.10 26.85 1.15 20.70 46.21
Gold Know. 5.39 76.23 42.69 1.44 38.41 73.38
No Know. 0.2 1.33 10.96 13.01 4.84 11.04 6.94
In-Context Learning Retrieved Know. 1.06 13.83 12.53 6.09 12.22 10.26
Gold Know. 1.33 25.95 28.74 7.07 15.88 21.74
MistralI
No Know. 0.9 4.09 15.47 29.27 0.67 18.63 12.73
Fine-Tuning Retrieved Know. 3.85 21.63 30.44 1.18 20.49 45.40
Gold Know. 3.94 68.36 43.04 1.46 38.21 70.54
Ground Truth 100 100 37.79 38.48 1.52 100 100
Table 6: Automatic Evaluation BLEU-4, KF1, F1 and ROUGE-L for In-Context Learning and Fine-Tuning with
Retrieved (top-3) and Gold (ground-truth) knowledge, on Llama2C and MistralI , in different dialogue types:
Open-Domain Dialogues (ODDs), Knowledge Grounded Dialogues (KGDs), Task-Oriented Dialogues (TODs), and
Question Answering (QA).
Table 7: Automatic Evaluation BLEU-4 and KF1 for In-Context Learning and Fine-Tuning with Retrieved
(top-3) and Gold (ground-truth) knowledge, on Llama2C and MistralI , in Task-Oriented Dialogues (TODs). †
indicates that only test turns with unseen knowledge were included.
195
External Correctness
Model Technique
Knowledge
ODD KGD TOD QA
No Know. 95 80 95 75
In-Context Learning Retrieved Know. 80 60 60
Gold Know. 80 70 80
Llama2C
No Know. 65 90 70 75
Fine-Tuning Retrieved Know. 90 90 55
Gold Know. 85 85 85
No Know. 95 70 75 60
In-Context Learning Retrieved Know. 55 70 50
Gold Know. 85 60 80
MistralI
No Know. 65 85 80 50
Fine-Tuning Retrieved Know. 75 100 45
Gold Know. 70 80 85
Ground-Truth 95 70 85 80
Table 8: Human Evaluation Percentage of Correct (ODD, KGD, TOD, QA) responses for In-Context Learning and
Fine-Tuning with Retrieved (top-3) and Gold (ground-truth) knowledge, on Llama2C and MistralI , for different
dialogue types: Open-Domain Dialogues (ODDs), Knowledge Grounded Dialogues (KGDs), Task-Oriented
Dialogues (TODs), and Question Answering (QA).
Table 9: Question and answer options presented to the annotators for the proposed Validity dimension.
196
Recall@K for different Dialogue Settings
100 QA
TOD
90 KGD (Sentence)
80 KGD (Paragraph)
Recall@K (%)
70
60
50
40
30
20
10
0
1 3 5
Number of documents retrieved (K)
Figure 5: Performance of the off-the-shelf retriever for
each dialogue type. The retriever achieves the lowest
Recall@K on TOD because of the larger knowledge
base size (2900 documents). However, the retriever
achieves a higher Recall@K for QA, even though its
knowledge base is bigger than the one for KGD (355
vs. 61 ± 21). Further studies indicate that, despite the
model is not capable to retrieve the exact sentence of
the annotator (KGD Sentence), the retriever selects a
sentence belonging to the same paragraph more than
69% of the time (KGD Paragraph).
197