Sources of Hallucination by Large Language Models On Inference Tasks
Sources of Hallucination by Large Language Models On Inference Tasks
(NLI), necessary for applied tasks like question priate, presented in a factual manner.
answering and summarization, yet this capa- This paper investigates how LLMs perform on
bility is under-explored. We present a series natural language inference tasks, sometimes called
of behavioral studies on several LLM families textual entailment, a basic capability forming part
(LLaMA, GPT-3.5, and PaLM) which probe of language understanding, which is used in real
their behavior using controlled experiments.
tasks. We look at directional entailments, which
We establish two factors which predict much
of their performance, and propose that these hold in one direction, but not both, for example,
are major sources of hallucination in genera- DEFEAT entails PLAY but PLAY does not entail
tive LLM. First, the most influential factor is DEFEAT. Inferring directional entailment is more
memorization of the training data. We show difficult than that of paraphrase, which is symmet-
that models falsely label NLI test samples as ric, so it more deeply probes understanding.
entailing when the hypothesis is attested in the Our approach is a behavioral study of prompted
training text, regardless of the premise. We fur-
LLM decision-making. We alter existing direc-
ther show that named entity IDs are used as "in-
dices" to access the memorized data. Second, tional inference datasets in targeted ways while
we show that LLMs exploit a further corpus- measuring how predictions change, across sev-
based heuristic using the relative frequencies of eral major LLM families (LLaMA, GPT-3.5, and
words. We show that LLMs score significantly PaLM). We demonstrate two sources of LLM per-
worse on NLI test samples which do not con- formance on the directional NLI task, which also
form to these factors than those which do; we explain false positive hallucination: (1) LLM bias
also discuss a tension between the two factors,
toward affirming sample hypotheses that are at-
and a performance trade-off.1
tested in the training text, including reliance on
1 Introduction named entity identifiers; and (2) a corpus-term-
frequency heuristic, biased toward test samples
Large Language Models (LLMs) such as LLaMA, with premises less frequent than hypotheses.
GPT-3/4, PaLM, etc. (Touvron et al., 2023; Brown
We establish that these originate from the LLM
et al., 2020; Chowdhery et al., 2022), have been
pretraining objective, in which statistical modeling
trusted by many to perform language understand-
of the natural distribution of human-generated text
ing in downstream tasks such as summarization,
leads to (at the level of sentences) memorizing in-
question answering, and fact verification, among
dividual statements, and (at the level of corpora)
others (Zhang et al., 2023). However, due to the
learning typical patterns of usage. Though superfi-
large-scale nature of LLM training on vast, often
cially impressive in performance, our experiments
proprietary data, and the inherent opacity of LLM
show that even powerful LLMs still use unsatisfac-
parameters, it is difficult to explain their behavior
tory tools instead of human-like reasoning.
when answering user queries and the corresponding
We present three contributions, the demonstra-
risks in terms of bias and robustness. In particular,
tions of both factors across several experiments,
*
Co-first authors with equal contribution. and an impact analysis:
1
Code and LLM outputs (from LLaMA and GPT-3.5)
are available at https://github.com/Teddy-Li/ (1) In a prompting scenario, LLMs respond to
LLM-NLI-Analysis. NLI test samples according to a veracity bias, af-
firming hypotheses more readily if seen in the pre- language inference, and frequency in training data
training text. When a hypothesis proposition is affects generalization in various ways (§6, §7).
attested in training according to the model itself, Webson and Pavlick (2022) show that LLMs
LLaMA-65B, GPT-3.5, and PaLM-540B are re- perform surprisingly well in NLP tasks even condi-
spectively 1.9, 2.2, and 2.0 times more likely to tioned on pathologically unhelpful prompts, calling
wrongly predict a false positive inference, com- into question whether LLMs understand prompts
pared to when not attested. Further, LLMs recall the same way as humans. In this vein, we show
this memorized information using named entities that tasks formatted for language inference may be
as identifying “indices,” even though these are ir- answered like memory recall tasks instead, which
relevant to the logic of the predicate inference task. may happen to align with correct labels.
(2) When relevant memorized text is not avail- Recently, Bubeck et al. (2023) advocated that
able, LLMs back off to a simple corpus-based GPT-4 has a deep and flexible understanding of
heuristic using term-frequency to make judgements. language “far beyond memorization”. Although
LLaMA-65B, GPT-3.5, and PaLM-540B are 1.5, we do not disprove the existence of such abilities,
1.7 and 2.0 times more likely to wrongly predict a we show that GPT-4, like the other LLMs, is also
false positive if the hypothesis has higher relative subject to these hallucinations (see Appendix D).
frequency than the premise, than if it does not. Corpus frequency statistics have long been stud-
(3) For the NLI subsets consistent with these fac- ied: it is well-known that nouns follow a trend of
tors, LLMs appear to be excellent classifiers; for becoming more specific as corpus-frequency de-
NLI subsets adversarial to them, LLM performance creases (Caraballo and Charniak, 1999). McKenna
degrades severely. We show that when labels go and Steedman (2022) also argued that more spe-
against the veracity prior, LLMs degrade into poor cific predicates tend to be less corpus-frequent than
or even near-random classifiers; for the relative more general predicates. Since entailments may
frequency heuristic, we also show a substantial per- carry from a specific predicate to a more general
formance decrease with all the LLMs consistently. one, e.g. SPRINT entails RUN, relative corpus fre-
quency can be indicative of entailment, though it
2 Related Work has no direct relationship to meaning.
Table 1: From the original dataset task (I) we derive the Random Premise task (IRP ), where the premise predicate
has been randomized while respecting type-constraints. A random premise predicate is highly unlikely to entail the
hypothesis, so all labels become False. Sample format: (Label) [premise] ⇒ [hypothesis].
Table 2: We estimate the probabilities of predicting an entailment in the original task (I) and random premise task
(IRP ), conditioned on the model’s own judgement of hypothesis veracity (V ). In IRP all judgements of Entail
are false positives (hallucinations).
show a 1.9x, 2.2x, and 2.0x higher chance of pre- answers could be illogical, contradictory, and could
dicting that a random premise falsely Entails misrepresent the views of the KB, or other harms.
the hypothesis if it already predicts the hypothesis Such poor use of in-context learning has already
is veracious. We further investigate the impact of been observed in specific domains like medicine
such hallucination on NLI performance in §8. (Jimenez Gutierrez et al., 2022).
This behavior is observed across model families In general, this is a risk for LLMs which (a) are
(LLaMA, GPT, and PaLM), establishing that it is deployed for tasks like QA by feeding novel text
due to pretraining rather than Instruction-tuning (e.g. a legal document) in-context as part of the
or RLHF, since LLaMA and PaLM have only un- user query, and (b) are trained on datasets which are
dergone pretraining. This behavior is undesirable, private or otherwise infeasibly large to read man-
because model predictions on NLI tasks should be ually, containing many facts and human opinions
based solely on general language understanding, unknowable to both the user and modeler.
and no prior knowledge. We may conclude that
memory of statements in training data is a signifi- 6 Experiment 2:
cant contributor to LLM inferences, and may be an Entities are Indices to LLM Memories
important source of LLM hallucination.
In §5, we have established that propositional mem-
ory explains a significant portion of false positives
5.2 Implications for Real Applications
in LLM inference predictions. In this section, we
Using prior knowledge as part of language infer- continue by showing the importance of entity IDs
ence has bad implications for the use of LLMs in in the process of LLMs’ memory recall.
real applications. We offer an example scenario As described in §3.3, we transform the
of a question-answering task where user questions Levy/Holt dataset into the IT A task with typed
are answered from a Knowledge Base (KB). In identifiers and two IRA tasks with type-constrained
typical formulations of this task, if a statement in random entities: the random-infrequent task IRA ↓,
the KB (premise) entails a user query (hypothe- and random-frequent task IRA ↑ (Table 3 shows
sis), the premise may be formulated into an answer. examples).
Consider a KB such as a legal document or HR By replacing only the entities with others of the
rulebook. Assume that the text is prepended to the same types, entailment labels do not change; how-
user query and presented to the LLM, as in other ever, the new samples should contain novel strings
works (Srinivasan et al., 2022). Given our findings, which are not attested in the training data. We ex-
we might observe the LLM hallucinate answers to pect that an ideal model, capable of generalizing
questions using information which is not presented predicate inference, would maintain its predictions
in the KB, but may have been read by the LLM in across all conditions; on the other hand, flawed
text from other sources during pretraining. These models utilizing the veracity prior would predict
Task Dev Sample
I (True) India
India rice ⇒ India
India exports tons of rice
rice India
India exports rice
rice
rice
IT A (True) location
location
location X
X
X exports tons of food
food Y ⇒ location
food Y
Y location
location XX
X exports food
food
food Y
Y
Y
IRA ↓ (True) Sloterdijk
Sloterdijk exports tons of oatmeal
Sloterdijk cookies ⇒ Sloterdijk
oatmeal cookies
oatmeal cookies Sloterdijk exports oatmeal
Sloterdijk oatmeal cookies
oatmeal cookies
cookies
IRA ↑ (True) Helsinki
Helsinki
Helsinki exports tons of Granny
Granny Smith ⇒ Helsinki
Granny Smith
Smith Helsinki
Helsinki exports Granny
Granny
Granny Smith
Smith
Smith
Table 3: An original dev sample (I) is transformed by insertion of entity types (IT A ), real entities sampled uniform
randomly from the 5% least frequent entities mentioned in NewsCrawl, constrained to the same entity type (IRA ↓),
and the same, from the 5% most frequent (IRA ↑). Sample format: (Label) [premise] ⇒ [hypothesis].
Levy/Holt (Directional)
Model Task Precision Recall ∆-Recall
I 67.0 68.4 0
IT A 69.0 66.9 -1.5
LLaMA-65B
IRA ↓ 64.0 63.8 -4.6
IRA ↑ 67.2 53.7 -14.7
I 62.4 92.3 0
GPT-3.5 IT A 65.1 75.7 -16.6
text-davinci-003 IRA ↓ 65.5 66.5 -25.8
IRA ↑ 68.8 55.3 -37.0
I 72.8 76.2 0
IT A 79.8 50.8 -25.4
PaLM-540B
IRA ↓ 69.5 58.7 -17.5
IRA ↑ 70.8 52.4 -23.8
Table 4: Scoring model outputs in different argument-replacement tasks. We indicate the highest and lowest recall
score across replacement settings, and note that recall decreases sharply across settings in all models.
fewer Entail labels, since entity IDs no longer higher recall score (GPT-3.5 @66.5) than frequent
identify statements from training. entities (IRA ↑) (GPT-3.5 @55.3).
These findings corroborate those from §5, that
6.1 Results LLMs use memory as part of inference, and addi-
We run the models on each dataset condition and tionally show that these memories are recalled us-
report results in Table 4. We notice two important ing the identity of entities acting as indices. These
phenomena across all three models, aligning with experiments demonstrate that too much prior ex-
our hypothesis of a flawed model: posure to an entity may interfere with model gen-
eralization when that entity is discussed in novel
First, we observe that all models’ behavior sig-
inferences: the more a model has read about an
nificantly changes in the same way when original
entity during pretraining, the less capable it is of
entities are replaced by either entity types or ran-
drawing novel natural language inferences involv-
dom real entities. Despite similar (or marginally
ing it, even though those inferences do not require
increasing) precision across conditions, recall de-
detailed knowledge of the entity ID.
grades drastically from original entities (I) (GPT-
As in §5, we observe a consistent effect across
3.5 @92.3) to random frequent entities (IRA ↑)
model families, indicating its root in LLM pretrain-
(GPT-3.5 @55.3). Type-placeholder IT A perfor-
ing. We also tried explicitly instructing LLMs to
mance also degrades in this way, showing that
ignore the veracity of individual statements but did
this is not a matter of poorly selected real entities,
not see significant improvement (see Appendix B).
but rather a loss of information from the original
dataset that models were using to answer questions.
7 Experiment 3:
Second, we observe a significant difference in Backoff to Relative Frequency Heuristic
performance between the two real entity conditions
IRA ↓ and IRA ↑, which are both composed of Our natural response to the veracity prior is to
unattested statements, but which contain entities test model capabilities while blocking the effects
that differ in typical corpus frequency. Infrequent of memory. Following from §6, we apply a fur-
entities (IRA ↓) yield better generalization and a ther type-argument transformation to IRP , yielding
Predictor LLaMA-65B GPT-3.5 PaLM-540B
P (IRP _T A = Entail | F = Win) 26.9 23.8 11.5
P (IRP _T A = Entail | F = Lose) 18.0 14.0 5.8
P (IRP = Entail | F = Win) 32.2 40.8 31.4
P (IRP = Entail | F = Lose) 28.4 29.6 24.9
Table 5: We estimate the probability of predicting that a randomized premise entails the hypothesis, either
with original arguments (IRP ) or typed identifiers (IRP _T A ), conditioned on the relative frequency of the pair of
lemmatized predicates (F ). In both IRP and IRP _T A , all judgements of Entail are false positives (hallucinations).
the IRP _T A task. With the identities of arguments premise and hypothesis. Again, the effect is consis-
masked, statements cannot be recalled about spe- tent across model families, revealing its root in the
cific entities to use on this task; additionally, simi- large-scale pre-training process, rather than model
larly to the IRP condition, the ground-truth label of peculiarities or fine-tuning.
each IRP _T A sample remains No-Entail, so all The IRP results show that the relative frequency
model predictions of Entail are false positives. heuristic has a weaker effect when entity-based
We verify that entity-based memory has been memories are available. This indicates a tension
blocked by running an analogous type-argument between V and F : memory may be used when
variant of the V task in §3.1 (VT A ), in which we di- available, and if not, the predicate pairing may be
rectly query the LM about the predicted veracity of attended to more closely.
a hypothesis, with entities replaced by typed identi-
fiers. For GPT-3.5, only 2 hypotheses in the VT A 8 Impact of Bias on Performance
task are predicted as veracious; for LLaMA, the
number is also only 111 / 1,784. This verifies that We have demonstrated two sources of hallucination
entity-specific memories are effectively blocked. by LLMs on inference tasks. We now assess their
After blocking memory, we observe that the aver- impact on model performance to quantify the risks
age probability of false positives among the LLMs of such hallucinations.
drops from 31.3% in the IRP condition to 16.6% We compare LLMs’ performance between sub-
in the IRP _T A condition. However, despite the en- sets of the Levy/Holt dataset that are consistent or
couraging general statistics, we also observe the adversarial to each factor. An entry P ⊨ H? is
emergence of another factor in LLM predictions, a consistent with a factor when the prediction with
relative frequency heuristic. We first calculate an the factor is the same as the gold entailment label;
attribute F for each Levy/Holt sample by querying conversely, it is adversarial to a factor when the
Google N-grams as in §3.1. F labels the confor- prediction with the factor disagrees with the label.
mance of the sample predicates to this heuristic. For this we again use the predictions from mod-
853 samples are F = W in (hypothesis predicate els’ veracity priors (V ) and the relative frequency
estimated to be at least 5x more corpus-frequent heuristic (F ). Subset statistics are in Table 6.
than premise), and 550 samples are F = Lose (at While earlier experiments scored model textual
least 5x less frequent). responses to characterize behavior change, we now
We run a similar experiment to §5 by calculat- use area under the precision-recall curve (AUC)
ing the model probabilities of reporting an entail- to summarize model performance over a tunable
ment in the IRP and IRP _T A tasks, conditioned on confidence threshold (scoring described in §4.2),
whether F is Win or Lose, shown in Table 5. which is better for measuring practical discrimina-
tive power. Following Li et al. (2022), we re-scale
7.1 Results AUC values to normalize over the label distribution,
We observe that the relative frequency heuristic (F ) yielding AU Cnorm values which assign random
is also a strong predictor of false positive rates (hal- classifiers 0% and perfect classifiers 100%.
lucinations), in IRP _T A , with a separation of 1.5x,
1.7x and 2.0x for LLaMA, GPT-3.5 and PaLM re- 8.1 Results
spectively. When samples conform to the heuristic, We report results in Table 7. Under the stan-
models are more likely to report an entailment, dard inference task I, the performance drop from
even though no semantic relation exists between VC ONSISTENT (VC ) to VA DVERSARIAL (VA ) is severe
Data Subset Criteria # of Entries
LLaMA GPT-3.5 PaLM
VC ONSISTENT (G = T rue ∧ V = T rue) ∨ (G = F alse ∧ V = F alse) 955 947 999
VA DVERSARIAL (G = T rue ∧ V = F alse) ∨ (G = F alse ∧ V = T rue) 829 837 785
FC ONSISTENT (G = T rue ∧ F = W in) ∨ (G = F alse ∧ F = Lose) 1,134
FA DVERSARIAL (G = T rue ∧ F = Lose) ∨ (G = F alse ∧ F = W in) 298
Table 6: Subsets defined by G (entailment label) with either V (hypothesis veracity prediction from each LLM) or
F (model-agnostic relative frequency heuristic). C ONSISTENT subsets align G with V /F . A DVERSARIAL subsets
misalign G with V /F .
Levy/Holt
Model Task VC VA diff. FC FA diff.
LLaMA-65B I 65.5 8.1 -57.4 41.4 34.3 -7.1
GPT-3.5 I 85.0 10.8 -74.2 52.6 41.0 -11.6
PaLM-540B I 79.1 31.5 -47.6 62.3 51.7 -10.6
LLaMA-65B IT A 52.1 34.4 -17.7 53.1 37.7 -15.4
GPT-3.5 IT A 67.1 18.8 -48.3 52.2 36.2 -16.0
PaLM-540B IT A 58.1 46.6 -11.5 58.2 44.8 -13.4
Table 7: LLM performance on subsets where V /F is Consistent/Adversarial to gold labels, measured with AU Cnorm
(0% = random chance performance). Decrease from VC /FC to VA /FA subsets are presented in the diff. columns.
for all 3 LLMs: they deteriorate from very good marginal impact on the comparisons. We show
classifiers to poor or even near-random ones. This details in Appendix Table 11.
fragility from the veracity prior can be alleviated Between pre-trained and instruction-tuned mod-
by masking entity IDs with type-identifiers (condi- els, the trends are the same. This suggests that the
tion IT A ), which reduces the performance drop. current LLM fine-tuning approaches either fail to
On the other hand, with the type replacements correct these behaviors or have overlooked them.
in IT A , LLMs are forced to focus on the pred- We call for attention to these behaviors to further
icates in each proposition. As a result, the im- narrow the residual performance gaps, and improve
pact of the relative frequency heuristic is intensi- model robustness when reasoning about language.
fied. From the standard inference task I to IT A ,
the performance gap between FC ONSISTENT (FC ) to 9 Conclusion
FA DVERSARIAL (FA ) subsets is widened for all LLMs. Across several major LLM families and experimen-
The differences are generally less dramatic in F - tal settings, we demonstrate two important factors
consistency subsets than in V -consistency subsets, in the performance of LLMs on natural language
partly because the relative frequency heuristic in- inference tasks, which may also manifest in ap-
volves pairs of predicates and may be more difficult plied tasks as hallucination. Contrary to claims of
for LLMs to capture, and potentially because fre- LLM general reasoning capabilities, we show that
quency measures with lemmatized predicates and much of this performance is achieved by (1) recall
Google N-gram are only a crude estimate of the of relevant memorizations and (2) corpus-based
actual frequencies in each LLM’s pre-train corpus. heuristics like term frequency. Since these factors
We note that for V -consistency comparisons, are reproduced in all families, we establish that
the difference in performance between entries with they originate in model pretraining.
aligned and misaligned V predictions could be in- We conclude that LLMs, though powerful, use
fluenced by model-specific idiosyncrasies such as unsatisfactory tools for the basic faculties of lan-
patterns in syntax or vocabulary, etc. We polled all guage understanding and inference. We propose
three LLMs for V predictions to acquire a majority preliminary approaches to help alleviate these prob-
vote Ṽ on sample veracity. Conditioning on Ṽ , we lems, but also argue that LLMs must make more
find consistency with Table 7 and sometimes larger progress before they can be relied on to reason in
effects, confirming that noise in V predictions has ways analogous to human beings.
Limitations SIGMOD International Conference on Management
of Data, SIGMOD ’08, page 1247–1250, New York,
In this paper, we have discussed two prominent NY, USA. Association for Computing Machinery.
sources of hallucination for LLMs when doing nat-
Samuel R. Bowman, Gabor Angeli, Christopher Potts,
ural language inference. We acknowledge that this and Christopher D. Manning. 2015. A large anno-
is not an exhaustive search of all the sources, where tated corpus for learning natural language inference.
further explorations could be done as future work. In Proceedings of the 2015 Conference on Empiri-
We also note that after ablating the factors dis- cal Methods in Natural Language Processing, pages
632–642, Lisbon, Portugal. Association for Compu-
cussed in this paper, there remains residual, unex-
tational Linguistics.
plained performance on NLI tasks. This residual
could be attributed to undiscovered biases or gener- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
alising inference capability. We leave the analysis Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
of this residual to future work.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
As discussed in Appendix A, we compared a Gretchen Krueger, Tom Henighan, Rewon Child,
range of popular LLM prompting techniques and Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
selected the most promising approach. We also do Clemens Winter, Christopher Hesse, Mark Chen, Eric
acknowledge that there could potentially be other Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
novel prompting techniques that could help the Alec Radford, Ilya Sutskever, and Dario Amodei.
LLMs resist the influence of the priors discussed 2020. Language models are few-shot learners.
in this paper. We identify this as an open question
and advocate for future research. Sébastien Bubeck, Varun Chandrasekaran, Ronen El-
dan, Johannes Gehrke, Eric Horvitz, Ece Kamar,
Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lund-
Ethics Statement berg, Harsha Nori, Hamid Palangi, Marco Tulio
Ribeiro, and Yi Zhang. 2023. Sparks of Artificial
This paper discusses two major sources of hallu- General Intelligence: Early experiments with GPT-4.
cination in LLM output when asked to perform ArXiv:2303.12712 [cs].
natural language inference, which we note is a ca-
Sharon A. Caraballo and Eugene Charniak. 1999. De-
pability required of many downstream tasks such
termining the specificity of nouns from text. In 1999
as summarization, question answering, etc. We Joint SIGDAT Conference on Empirical Methods in
show that users of LLMs may be subjected to faulty Natural Language Processing and Very Large Cor-
judgements if the content of their request overlaps pora.
with data in pretraining. However, it is difficult to
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski,
ascertain for both a user or modeler exactly what Katherine Lee, Florian Tramer, and Chiyuan Zhang.
is contained in pretraining data, or how this will 2023. Quantifying Memorization Across Neural Lan-
interact with a user’s query. Our proposed veracity guage Models. ArXiv:2202.07646 [cs].
prior shows promise in detecting potential overlaps,
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,
but model responses in applications of these cases Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
are not explored. Further, the relative frequency Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion
prior demonstrates a much more subtle problem Stoica, and Eric P. Xing. 2023. Vicuna: An open-
of corpus distribution that is naturally inherent to source chatbot impressing gpt-4 with 90%* chatgpt
quality.
model pretraining.
In light of these, the potential harms of LLM use Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
for drawing natural language inferences may in- Maarten Bosma, Gaurav Mishra, Adam Roberts,
clude: offering inaccurate or irrelevant information Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, Parker Schuh, Kensen Shi,
to a user’s query or contradiction of information Sasha Tsvyashchenko, Joshua Maynez, Abhishek
provided in-context with a user’s query. Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-
odkumar Prabhakaran, Emily Reif, Nan Du, Ben
Hutchinson, Reiner Pope, James Bradbury, Jacob
References Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat,
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sunipa Dev, Henryk Michalewski, Xavier Garcia,
Sturge, and Jamie Taylor. 2008. Freebase: A col- Vedant Misra, Kevin Robinson, Liam Fedus, Denny
laboratively created graph database for structuring Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim,
human knowledge. In Proceedings of the 2008 ACM Barret Zoph, Alexander Spiridonov, Ryan Sepassi,
David Dohan, Shivani Agrawal, Mark Omernick, An- Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish
drew M. Dai, Thanumalayan Sankaranarayana Pil- Sabharwal. 2018. Can a suit of armor conduct elec-
lai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, tricity? a new dataset for open book question answer-
Rewon Child, Oleksandr Polozov, Katherine Lee, ing. In Conference on Empirical Methods in Natural
Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Language Processing.
Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy
Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, D.B. Nguyen, Johannes Hoffart, M. Theobald, and
and Noah Fiedel. 2022. Palm: Scaling language mod- G. Weikum. 2014. Aida-light: High-throughput
eling with pathways. named-entity disambiguation. volume 1184.
Dan Hendrycks, Collin Burns, Steven Basart, Andy OpenAI. 2023. GPT-4 Technical Report.
Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- ArXiv:2303.08774 [cs].
hardt. 2021. Measuring massive multitask language
understanding. Proceedings of the International Con- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
ference on Learning Representations (ICLR). roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, John
Bernal Jimenez Gutierrez, Nikolas McNeal, Clayton Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Washington, You Chen, Lang Li, Huan Sun, and Maddie Simens, Amanda Askell, Peter Welinder,
Yu Su. 2022. Thinking about GPT-3 in-context learn- Paul Christiano, Jan Leike, and Ryan Lowe. 2022.
ing for biomedical IE? think again. In Findings of the Training language models to follow instructions with
Association for Computational Linguistics: EMNLP human feedback. ArXiv:2203.02155 [cs].
2022, pages 4497–4512, Abu Dhabi, United Arab
Emirates. Association for Computational Linguistics. Adam Poliak, Jason Naradowsky, Aparajita Haldar,
Rachel Rudinger, and Benjamin Van Durme. 2018.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Hypothesis only baselines in natural language infer-
field, Michael Collins, Ankur Parikh, Chris Alberti, ence. In Proceedings of the Seventh Joint Confer-
Danielle Epstein, Illia Polosukhin, Matthew Kelcey, ence on Lexical and Computational Semantics, pages
Jacob Devlin, Kenton Lee, Kristina N. Toutanova, 180–191, New Orleans, Louisiana. Association for
Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Computational Linguistics.
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
ral questions: a benchmark for question answering Martin Schmitt and Hinrich Schütze. 2021. Language
research. Transactions of the Association of Compu- Models for Lexical Inference in Context. In Proceed-
tational Linguistics. ings of the 16th Conference of the European Chap-
ter of the Association for Computational Linguistics:
Tianyi Li, Mohammad Javad Hosseini, Sabine Weber, Main Volume, pages 1267–1280, Online. Association
and Mark Steedman. 2022. Language Models Are for Computational Linguistics.
Poor Learners of Directional Inference. In Findings
of the Association for Computational Linguistics: Krishna Srinivasan, Karthik Raman, Anupam Samanta,
EMNLP 2022, pages 903–921, Abu Dhabi, United Lingrui Liao, Luca Bertelli, and Michael Bendersky.
Arab Emirates. Association for Computational Lin- 2022. QUILL: Query intent with large language mod-
guistics. els using retrieval augmentation and multi-stage dis-
tillation. In Proceedings of the 2022 Conference on
Xiao Ling and Daniel S. Weld. 2012. Fine-grained en- Empirical Methods in Natural Language Processing:
tity recognition. In Proceedings of the Twenty-Sixth Industry Track, pages 492–501, Abu Dhabi, UAE.
AAAI Conference on Artificial Intelligence, AAAI’12, Association for Computational Linguistics.
page 94–100. AAAI Press.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Dubois, Xuechen Li, Carlos Guestrin, Percy
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Liang, and Tatsunori B. Hashimoto. 2023. Stan-
Luke Zettlemoyer, and Veselin Stoyanov. 2019. ford alpaca: An instruction-following llama
RoBERTa: A Robustly Optimized BERT Pretrain- model. https://github.com/tatsu-lab/
ing Approach. arXiv:1907.11692 [cs]. ArXiv: stanford_alpaca.
1907.11692.
Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer,
Nick McKenna, Liane Guillou, Mohammad Javad Hos- and Armen Aghajanyan. 2022. Memorization with-
seini, Sander Bijl de Vroe, Mark Johnson, and Mark out overfitting: Analyzing the training dynamics of
Steedman. 2021. Multivalent entailment graphs for large language models. In Advances in Neural Infor-
question answering. In Proceedings of the 2021 Con- mation Processing Systems, volume 35, pages 38274–
ference on Empirical Methods in Natural Language 38290. Curran Associates, Inc.
Processing, pages 10758–10768, Online and Punta
Cana, Dominican Republic. Association for Compu- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
tational Linguistics. Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Nick McKenna and Mark Steedman. 2022. Smooth- Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
ing entailment graphs with language models. Grave, and Guillaume Lample. 2023. Llama: Open
ArXiv:2208.00318v1 [cs.CL]. and efficient foundation language models.
Albert Webson and Ellie Pavlick. 2022. Do prompt- In preliminary experiments with GPT-3.5, we ob-
based models really understand the meaning of their served that LLMs are not responsive to the 3 contra-
prompts? In Proceedings of the 2022 Conference of
positive prompts from Schmitt and Schütze (2021)
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech- (colored gray), performing at random. We also
nologies, pages 2300–2344, Seattle, United States. observed that prompt number 5 from Schmitt and
Association for Computational Linguistics. Schütze (2021) also consistently underperforms the
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten other 4 templates, so we use the remaining 4 tem-
Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, plates (namely, template no. 1, 2, 3, 4) as our final
and Denny Zhou. 2022. Chain of Thought Prompt- candidate set.
ing Elicits Reasoning in Large Language Models.
ArXiv:2201.11903 [cs] version: 1. In-context Examples have been widely used for
interactions with LLMs since the seminal work of
Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang,
Kathleen McKeown, and Tatsunori B. Hashimoto. Brown et al. (2020). Further, Wei et al. (2022)
2023. Benchmarking large language models for news has demonstrated that including chain-of-thoughts,
summarization. namely step-by-step explanations, in the in-context
examples, helps LLMs perform reasoning tasks.
On the other hand, Ouyang et al. (2022) has sug-
A Prompt Format Selection
gested that instruction-tuned LLMs are also capa-
In prompt-based interactions with the LLMs, sev- ble of performing tasks in zero-shot, without expo-
eral types of context information could be added sure to any in-context examples.
to help models produce accurate and robust predic- We compared zero-shot and few-shot in our pre-
tions. We attend to two design choices in prompt liminary experiments with LLaMA and GPT-3.5 on
engineering: prompt templates and in-context ex- Levy/Holt directional dev set. Following Touvron
amples. et al. (2023), for zero-shot, we prepend a textual
description of the task to each test sample; for few-
Prompt templates are known to have a direct
shot, we prepend a minimal 4 examples with ex-
and sometimes decisive impact on LLM behavior.
planations. Instantiated prompts in the two settings
As such, we carefully select a range of clear and
are demonstrated in Table 8. Here we report the
concise templates as promising candidates. As
dev set results with the best-performing templates.
discussed in §4.2, we run each template through
We found that for LLaMA, the model’s zero-shot
the dev sets of each dataset, and select the template
performance on the Levy/Holt directional dev set
with the best discriminative power according to
is near-random, at 56.6% AU C (random is 50%);
AUC scores (similarly to §8). The candidate set of
with 4 in-context examples, the model begins to ex-
templates includes 3 concise templates we wrote:
hibit non-trivial behavior, with 65.0% AU C. This
1. If [PREMISE], then [HYPOTHESIS]. is not surprising, since LLaMA is only a pre-trained
LLM without any instruction fine-tuning. For GPT-
2. PREMISE, so HYPOTHESIS. 3.5, the performance is still much lower in zero-
shot, at 64.5%, compared to 74.6% in few-shot.
3. PREMISE entails HYPOTHESIS. As discussed in §4.2, ideally we would like
We also considered the 5 prompt templates LLMs to have zero-shot natural language abili-
used in prior work on LMs for textual entailments ties readily available for downstream tasks. How-
(Schmitt and Schütze, 2021): ever, in light of this observation, our primary ex-
periments are conducted in the few-shot setting
4. PREMISE, which means that HYPOTHESIS. throughout, in order to reveal the abilities of these
LLMs to the fullest.
5. HYPOTHESIS, because PREMISE.
B The Ineffectiveness of Instructing
6. It is not the case that [HYPOTHESIS], let alone
that [PREMISE].
LLMs to Stop Attending to Veracity
In §5 and §6, we showed that entailment predic-
7. [HYPOTHESIS]N EG , which means that
tions from LLMs are strongly biased by their pre-
[PREMISE]N EG .
dictions on the veracity of the hypotheses. We
8. [PREMISE]N EG , because [HYPOTHESIS]N EG . wondered whether there are intuitive prompt engi-
neering techniques to steer its behavior away from in general. Therefore, with the majority vote, we
attending to veracity. mask these noises and acquire sound predictions
Towards this goal, we experimented with pre- on the grounded veracity of statements.
pending a line of task description to the few-shot Performances of LLMs between Ṽ -consistency
prompts in part B of Table 8, explicitly instruct- subsets are listed in Table 11. Gaps between
ing the models to ignore the veracity of individual the Ṽ -consistency subsets that are larger than V -
statements: Please check the entailments between consistency gaps are colored red; those narrower
the following hypothetical statements. Ignore the than V -consistency gaps are colored green. It is
veracity of these statements.. clear that the gaps are consistent between V /Ṽ -
We replicated the experiments in §5 and §6 with consistency experiments, where the gaps are even
GPT-3.5, but the results show only marginal im- larger on many occasions. This confirms, that the
provements in model behavior. performance gaps in V -consistency experiments
In Table 9, we show that instructing GPT-3.5 can be credited to the veracity prior, rather than
to ignore veracity does not help narrow the gap model-specific idiosyncrasies.
between V = T rue and V = F alse; instead, It is also to be noted that, since the F -consistency
ratios of positive predictions went down by similar subsets are separated based on the model-agnostic
amounts, indicating that the model is becoming criterion F , model-specific idiosyncrasies are not a
slightly more conservative in predicting positives problem for F -consistency comparisons.
when instructed to ignore veracity, but not in a
principled manner. D Impacts of Bias on GPT-4 Performance
Further, as shown in Table 10, despite the
GPT-4 (OpenAI, 2023) is a recent strong LLM
explicit instruction, recall still drops at similar
claiming SOTA performance on various NLP tasks.
scales when arguments are randomly replaced
Due to its closed-source nature and the impossibil-
with the same sets of frequent/infrequent replace-
ity of fully tracking the sources of its behaviors, we
ment entities as before. Since GPT-3.5 is an
refrain from reporting results with it in the main
instruct-finetuned model trained to be responsive
content of this paper.
to prompts, its failure means eradicating such bi-
However, in order to provide a richer context
ases from model outputs is a difficult task, one that
for the veracity prior and the Relative Frequency
needs further research attention.
Heuristic, in this section we report the perfor-
C The Reliability of V Measure and Its mance differences of GPT-4 between subsets con-
Relation to Grounded Veracity sistent/adversarial to the two factors.
As a light-weight experiment, we elicit GPT-4
The V -consistency subsets most directly capture predictions in the original I task in the zero-shot
the impacts of the veracity prior. However, as dis- setting, and re-use subsets from experiments in
cussed in §8.1, these subset separations are based §8. Specifically, for the veracity prior, we use
on V predictions from individual models, which the majority vote Ṽ among LLaMA, GPT-3.5 and
can be noisy, subject to model-specific noise such PaLM, to approximate V predictions from GPT-4
as trigger strings or responses to certain syntax itself; for the relative frequency heuristic, we keep
structures in the hypotheses, etc. the F measure for approximating corpus-frequency
To verify that the performance gaps in V - of terms.
consistency subsets that we observe in §8.1 comes Because GPT-4 is a commercial service and does
from predicted veracity and not any of the noise not provide logit confidence with their discrete pre-
sources, we experiment with another pair of subsets dictions, AU Cnorm values could not be calculated.
based on grounded veracity instead of predicted ve- Therefore, we are forced to report the F-1 scores
racity. at the binary prediction point of confidence. As
We use a majority vote among the three results in Table 12 show, we observe the same trend
independently-trained LLMs to approximate as in §8: for the subset adversarial to each factor,
grounded veracity, the approximation is denoted GPT-4 performance also drops substantially.
as Ṽ . This is because, any model-specific idiosyn- This experiment is designed to provide more
crasies should not be shared between LLMs in- context for the two factors discussed in the paper
dependently trained from different source corpora and NOT to compare GPT-4 with other models;
however, we can conclude that GPT-4 is subject to
the same fragilities as the other LLMs w.r.t. the
two factors, where our conclusions and recommen-
dations also apply.
A. Zero-shot Example Instantiated Prompt
Please check the entailments between the following statements.
Table 8: Example instantiated prompts in Zero-shot / Few-shot settings, for the test entry “PREMISE: [ephedrine
is widely used in medicine], HYPOTHESIS: [ephedrine is used in medicine]”. The few-shot prompts in part B are
used throughout the main experiments in this paper. We also present an example of the prompts we use for the
hypothesis-only V measure as described in §3.1.
GPT-3.5 Instructed to Ignore Veracity Not Instructed
P (I = Entail | V = True) 74.3 77.6
P (I = Entail | V ̸= True) 57.8 63.6
P (IRP = Entail | V = True) 39.0 41.3
P (IRP = Entail | V ̸= True) 17.6 18.8
Table 9: We estimate the probability of positive predictions in I and IRP tasks respectively given that the hypothesis
is predicted as veracious, namely V = T rue. Not instructed results are borrowed from Table 2 and listed here for
ease of comparison; also note that all IRP = Entail predictions are false positives.
Levy/Holt (Directional)
GPT-3.5 Condition Task Precision Recall ∆-Recall
I 64.9 90.8 0
Few-shot, instructed to ignore veracity. IRA ↓ 64.6 68.4 -22.4
IRA ↑ 67.5 58.1 -32.7
I 62.4 92.3 0
Few-shot, no instructions. IRA ↓ 65.5 66.5 -25.8
IRA ↑ 68.8 55.3 -37.0
Table 10: GPT-3.5 predictions when models are explicitly instructed to avoid taking the veracity of individual
statements into account. In the upper half are the instructed behavior, and in the lower half are the regular few-shot
behavior as in Table 4. Differences in recalls remain at a similar scale, with precision again stable, where the benefit
from the explicit instruction is marginal.
Levy/Holt
Model Task ṼC ṼA diff.
LLaMA-65B I 65.3 6.5 -58.8
GPT-3.5 I 70.8 23.5 -47.3
PaLM-540B I 80.7 28.3 -52.4
LLaMA-65B IT A 54.4 29.6 -24.8
GPT-3.5 IT A 56.2 35.5 -20.7
PaLM-540B IT A 59.3 40.1 -19.2