0% found this document useful (0 votes)
38 views17 pages

Sources of Hallucination by Large Language Models On Inference Tasks

1) The document presents a study investigating sources of hallucination by large language models (LLMs) on natural language inference tasks. 2) The study demonstrates two major sources of hallucination in LLMs - first, a bias toward affirming hypotheses that are present in the training text, even when irrelevant to the premise, and second, relying on a heuristic based on the relative frequencies of terms when no memorized text is available. 3) The study establishes that these sources of hallucination originate from the pretraining objective of statistically modeling natural text, which leads to memorizing statements and learning typical usage patterns, rather than human-like reasoning.

Uploaded by

Katherine Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views17 pages

Sources of Hallucination by Large Language Models On Inference Tasks

1) The document presents a study investigating sources of hallucination by large language models (LLMs) on natural language inference tasks. 2) The study demonstrates two major sources of hallucination in LLMs - first, a bias toward affirming hypotheses that are present in the training text, even when irrelevant to the premise, and second, relying on a heuristic based on the relative frequencies of terms when no memorized text is available. 3) The study establishes that these sources of hallucination originate from the pretraining objective of statistically modeling natural text, which leads to memorizing statements and learning typical usage patterns, rather than human-like reasoning.

Uploaded by

Katherine Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Sources of Hallucination by Large Language Models on Inference Tasks

Nick McKenna† * Tianyi Li† *


Liang Cheng† Mohammad Javad Hosseini‡ Mark Johnson§ Mark Steedman†

University of Edinburgh ‡ Google Research § Macquarie University
{nick.mckenna, tianyi.li}@ed.ac.uk

Abstract one LLM behavior poses a significant challenge:


“hallucination,” the phenomenon in which LLMs
Large Language Models (LLMs) are claimed
to be capable of Natural Language Inference provide information which is incorrect or inappro-
arXiv:2305.14552v1 [cs.CL] 23 May 2023

(NLI), necessary for applied tasks like question priate, presented in a factual manner.
answering and summarization, yet this capa- This paper investigates how LLMs perform on
bility is under-explored. We present a series natural language inference tasks, sometimes called
of behavioral studies on several LLM families textual entailment, a basic capability forming part
(LLaMA, GPT-3.5, and PaLM) which probe of language understanding, which is used in real
their behavior using controlled experiments.
tasks. We look at directional entailments, which
We establish two factors which predict much
of their performance, and propose that these hold in one direction, but not both, for example,
are major sources of hallucination in genera- DEFEAT entails PLAY but PLAY does not entail
tive LLM. First, the most influential factor is DEFEAT. Inferring directional entailment is more
memorization of the training data. We show difficult than that of paraphrase, which is symmet-
that models falsely label NLI test samples as ric, so it more deeply probes understanding.
entailing when the hypothesis is attested in the Our approach is a behavioral study of prompted
training text, regardless of the premise. We fur-
LLM decision-making. We alter existing direc-
ther show that named entity IDs are used as "in-
dices" to access the memorized data. Second, tional inference datasets in targeted ways while
we show that LLMs exploit a further corpus- measuring how predictions change, across sev-
based heuristic using the relative frequencies of eral major LLM families (LLaMA, GPT-3.5, and
words. We show that LLMs score significantly PaLM). We demonstrate two sources of LLM per-
worse on NLI test samples which do not con- formance on the directional NLI task, which also
form to these factors than those which do; we explain false positive hallucination: (1) LLM bias
also discuss a tension between the two factors,
toward affirming sample hypotheses that are at-
and a performance trade-off.1
tested in the training text, including reliance on
1 Introduction named entity identifiers; and (2) a corpus-term-
frequency heuristic, biased toward test samples
Large Language Models (LLMs) such as LLaMA, with premises less frequent than hypotheses.
GPT-3/4, PaLM, etc. (Touvron et al., 2023; Brown
We establish that these originate from the LLM
et al., 2020; Chowdhery et al., 2022), have been
pretraining objective, in which statistical modeling
trusted by many to perform language understand-
of the natural distribution of human-generated text
ing in downstream tasks such as summarization,
leads to (at the level of sentences) memorizing in-
question answering, and fact verification, among
dividual statements, and (at the level of corpora)
others (Zhang et al., 2023). However, due to the
learning typical patterns of usage. Though superfi-
large-scale nature of LLM training on vast, often
cially impressive in performance, our experiments
proprietary data, and the inherent opacity of LLM
show that even powerful LLMs still use unsatisfac-
parameters, it is difficult to explain their behavior
tory tools instead of human-like reasoning.
when answering user queries and the corresponding
We present three contributions, the demonstra-
risks in terms of bias and robustness. In particular,
tions of both factors across several experiments,
*
Co-first authors with equal contribution. and an impact analysis:
1
Code and LLM outputs (from LLaMA and GPT-3.5)
are available at https://github.com/Teddy-Li/ (1) In a prompting scenario, LLMs respond to
LLM-NLI-Analysis. NLI test samples according to a veracity bias, af-
firming hypotheses more readily if seen in the pre- language inference, and frequency in training data
training text. When a hypothesis proposition is affects generalization in various ways (§6, §7).
attested in training according to the model itself, Webson and Pavlick (2022) show that LLMs
LLaMA-65B, GPT-3.5, and PaLM-540B are re- perform surprisingly well in NLP tasks even condi-
spectively 1.9, 2.2, and 2.0 times more likely to tioned on pathologically unhelpful prompts, calling
wrongly predict a false positive inference, com- into question whether LLMs understand prompts
pared to when not attested. Further, LLMs recall the same way as humans. In this vein, we show
this memorized information using named entities that tasks formatted for language inference may be
as identifying “indices,” even though these are ir- answered like memory recall tasks instead, which
relevant to the logic of the predicate inference task. may happen to align with correct labels.
(2) When relevant memorized text is not avail- Recently, Bubeck et al. (2023) advocated that
able, LLMs back off to a simple corpus-based GPT-4 has a deep and flexible understanding of
heuristic using term-frequency to make judgements. language “far beyond memorization”. Although
LLaMA-65B, GPT-3.5, and PaLM-540B are 1.5, we do not disprove the existence of such abilities,
1.7 and 2.0 times more likely to wrongly predict a we show that GPT-4, like the other LLMs, is also
false positive if the hypothesis has higher relative subject to these hallucinations (see Appendix D).
frequency than the premise, than if it does not. Corpus frequency statistics have long been stud-
(3) For the NLI subsets consistent with these fac- ied: it is well-known that nouns follow a trend of
tors, LLMs appear to be excellent classifiers; for becoming more specific as corpus-frequency de-
NLI subsets adversarial to them, LLM performance creases (Caraballo and Charniak, 1999). McKenna
degrades severely. We show that when labels go and Steedman (2022) also argued that more spe-
against the veracity prior, LLMs degrade into poor cific predicates tend to be less corpus-frequent than
or even near-random classifiers; for the relative more general predicates. Since entailments may
frequency heuristic, we also show a substantial per- carry from a specific predicate to a more general
formance decrease with all the LLMs consistently. one, e.g. SPRINT entails RUN, relative corpus fre-
quency can be indicative of entailment, though it
2 Related Work has no direct relationship to meaning.

Prior work has addressed the robustness issues of 3 Experimental Design


NLI datasets. Poliak et al. (2018) found a range of
We design behavioral experiments on LLMs by
NLI datasets to be subject to a hypothesis-only bias,
modifying an original NLI dataset in various condi-
whose labels are predictable by supervised models
tions, and observing the change in model response.
trained on only the hypothesis. In this paper, we
By controlling for each tested factor, we demon-
use a similar hypothesis-only test with LLMs, but
strate significant behavior change across three ma-
with few-shot examples, to probe model memory
jor LLM families due to memory effects in §5 and
without training.
§6, and a corpus frequency heuristic in §7. Finally,
Li et al. (2022) show that smaller Language Mod-
we show the impact on task performance in §8.
els such as RoBERTa (355M parameters) (Liu et al.,
We now define the two priors explored in this
2019), under the pre-train-fine-tune paradigm, are
paper, as well as the dataset and test conditions in
reliant on dataset artifacts when performing direc-
which we evaluate each LLM.
tional predicate inference. In this paper, we study
the behavior of a class of much larger Language 3.1 Veracity and Relative Frequency Priors
Models, which have previously demonstrated more
The Veracity Prior indicates when a proposi-
robust performance across NLP tasks.
tional statement is likely to be attested in some way
Recent work has also explored LLM memoriza- by LLM training data. We measure the attestation
tion and generalization. Carlini et al. (2023) estab- of a statement using the LLMs’ own response when
lish that LLMs are able to memorize more data than prompted to predict the veracity of a hypothesis
small LMs, whereas Tirumala et al. (2022) further proposition: “[hypothesis]. is this true or false?”.2
hypothesize that LLMs pay special attention early
2
in training to numbers and nouns, which may act as One alternative is to use the LLM’s perplexity for a given
proposition; however, since perplexity is not available with
unique identifiers for individual sentences in train- GPT-3 series models, for even evaluation across models we
ing. We further show that entity recall is used in use the veracity predictions instead.
Veracity predictions are denoted with V . predicates, not any task relating to knowledge of
As discussed in §2, we draw inspiration from the world, so we explicitly avoid datasets such as
the hypothesis-only baseline (Poliak et al., 2018), MMLU (Hendrycks et al., 2021), Natural Ques-
but we use the test to probe model memory with- tions (Kwiatkowski et al., 2019), OpenBookQA
out training. We describe prompt generation in (Mihaylov et al., 2018) etc.
more detail in §4.2, and appendix Table 8 shows
an example. 3.3 Dataset Transformations
The standard inference task I is presented by
The Relative Frequency Heuristic is a sim-
the Levy/Holt NLI dataset, in which the answer
ple corpus-based heuristic, assigning a label of
is determinable using only general language infer-
Entail if the premise is less corpus-frequent than
ence of predicates and their attributes. As a task
the hypothesis. This heuristic is reflected in the nat-
over sentences, each sample sentence also contains
ural distribution of predicates in text due to the
entity arguments.
anti-correlation of specificity and frequency (more
specific terms tend to be less corpus-frequent), and We define three dataset transformations designed
the fact that specific terms may entail more general to remove aspects of information from original sam-
terms (McKenna and Steedman, 2022). However, ples. We study the change in model behavior as tar-
this effect has no direct relationship with meaning, geted information is removed. We define three new
thus is risky to trust. transformed tasks: randomized premise predicate
This heuristic would ideally be measured accord- IRP , type-arguments IT A , and type-constrained
ing to an LLM’s pre-train corpus, however, these randomized arguments IRA .
are impractically large and/or proprietary. Instead, Each transformation involves first identifying
we use Google N-grams3 as a proxy of the natu- the types of entities in statements in order to con-
ral distribution of text, and thus the distributions strain replacements in a natural way which is pur-
of these corpora. We take average frequencies be- posefully meaning-preserving or -destructive. To
tween the years 1950-2019, and compare between type the entities in each statement, we derive their
the premise P and the hypothesis H. FIGER type (Ling and Weld, 2012) such as “per-
To account for noise in corpus distributions, we son,” “location,” “organization,” etc. This is done
lemmatize each predicate, discard surrounding con- using an entity linker (Nguyen et al., 2014) which
text words, and require a wide margin of difference identifies an entity’s Freebase ID (Bollacker et al.,
between P and H frequencies. 2008), from which we obtain a type t ∈ T among
Predictions from this prior are denoted with F : the 48 FIGER types + 1 default type “thing” used
if H is at least 5x more frequent than P , we label in failure cases.
it F = win; if H is at least 5x less frequent than The random premise task IRP replaces the
P , we label it F = lose; anything in-between is a original premise predicate with a random predi-
draw and left out of F -analyses. cate, while maintaining the same entity arguments.
3.2 Dataset This test is designed to break the link between the
premise and hypothesis, since a randomly sam-
Levy/Holt For our experiments, we use the pled premise predicate is very unlikely to entail
Levy/Holt dataset, containing premise-hypothesis the hypothesis in all cases. We aim to test model
pairs with a task formatted: “Given [premise P ], sensitivity when entities remain the same, but the
is it true that [hypothesis H]?”. Each P - and H- predicates are no longer semantically related.
statement has the property of containing one pred- To maintain naturalness and grammaticality, we
icate with two entity arguments, (where the same insert a new predicate that satisfies the same entity
entities appear in both P and H) as shown in Ta- type-constraints as the old one: the replacement
ble 1, so entailment is decidable on the basis of the should have argument slots of the same types as
predicate and its attributes. We study the challeng- the original premise. For example, “[medicine] is
ing subset of 1,784 directional questions, where indicated for patients with [disease]” is swapped
entailments hold in one direction but not both. for “[medicine] does not cure [disease]”. We map
We aim to test LLMs on their capability to rea- the entities to their respective slots between the
son purely about the semantics of natural language original premise and the replacement. IRP is a
3
https://books.google.com/ngrams good test of generalizing language understanding
since new strings are created which we assume the draw new entities uniform randomly from the 5%
model has not seen before in training. An example least frequently mentioned entities in NewsCrawl
is shown in Table 1. (IRA ↓), and the 5% most frequent (IRA ↑). We
To create the IRP task, we source candidate swap the arguments for sampled entities while pre-
premises from the full Levy/Holt development serving the rest of each statement.
set of 5,486 questions, and sample uniform ran-
domly from the predicates satisfying the target 4 Querying Models with Prompts
type-constraints. In this task, entailment labels We describe our methodology for model selection,
are all assumed to be No-Entail, since a ran- prompt development, and model scoring.
domly sampled premise is unlikely to entail the
hypothesis. 4.1 Models
GPT-3 Series Though closed to deep scientific
The type-argument task IT A replaces the orig- review, these are a widely-used comparison due to
inal arguments with unique, typed identifiers, e.g. their performance, and have been reasonably well-
“location X” or “food Y”. In using basic FIGER studied. We take text-davinci-003 (GPT-3.5) as the
types to mask the identities of arguments, this test primary subject of evaluation (Brown et al., 2020),
is designed to remove extraneous information while as it is the largest and best-aligned version.
maintaining the same entailment label, as a base-
line control setting. We append unique identifiers LLaMA A recent LLM model family which ri-
“X” and “Y” to allow tracking of entity slots across vals or surpasses GPT-3 performance while being
the premise and the hypothesis, in case statements open to scientific study. LLaMA provides a range
involve two arguments of the same type. of model sizes; we test with the largest 65B model.
LLaMA is not fine-tuned; while there have been ef-
The random argument task IRA builds off forts to fine-tune them for alignment with humans
of IT A by replacing original entities with other (Taori et al., 2023; Chiang et al., 2023), we did
real, random entities of the same type. Like IT A , not find these to be significantly different from the
this test is designed to modify statements without original on our task, so we leave them out.
changing entailment truth values, and tests model
PaLM One of the largest available LLM fami-
sensitivity to novel extraneous information. To this
lies, we test with the largest 540 billion parameter
end, we replace the arguments only with entities
model, which often claims state-of-the-art on eval-
of the same FIGER types. We use the same map-
uation datasets (Chowdhery et al., 2022). As it
ping across paired samples which test both direc-
is only pretrained, this model serves as a further
tions of entailment: forwards (a ⊨ b) and reverse
comparison point to LLaMA.
(b ⊭ a). Examples for all argument transformations
Later GPT models such as text-davinci-003,
are shown in Table 3.
used in our experiments, have trained in several
IRA is also designed to create novel strings that phases including pretraining, instruction-tuning,
are unlikely to be in pre-training data. However, the and human alignment via Reinforcement Learn-
truth value of dataset entailments is not changed, ing through Human Feedback (RLHF), while base
since sample labels are determinable using only the LLaMA and PaLM models have only undergone
predicate. The entity type constraints additionally pretraining, so their contrast indicates what stage
ensure polysemous predicates maintain the same of training is responsible for observed phenomena.
sense. For example, a different sense of run can be Finally, we omit experimenting on open models
read from “[person] runs [organization]” and “[per- superseded in performance by LLaMA, such as
son] runs [software]”; but among different entities OPT, GPT-J, etc., and also models which are closed
of the same types, predicate senses are consistent, to scientific reviews, such as GPT-4, Bard, etc. 4
so the exact entity IDs do not affect general infer-
ence about predicates. 4.2 Prompt Design and Evaluation
We source new entities from NewsCrawl, a Formatting We feed each test sample into the
decade-long span of multi-source news text, in model by insertion of the premise and hypothesis
which entities are linked and typed as above. 4
In appendix D, we also report the impact of the two factors
This corpus is used and described in other work on GPT-4, the LLM receiving the most spotlight; we show
(McKenna et al., 2021; Li et al., 2022). We that GPT-4 shares the same fragilities as other LLMs.
into a prompt template, which is used to query Where I is the indicator function, and Sent es-
the model in natural language. Following this, we timates the probability of positive classification
append a three-way answer choice: A) Entailment, (Entail) from a textual output (0 ≤ Sent ≤ 1)
B) Neutral, C) Contradiction, following the typical with token probability Stok using a linear trans-
format in NLI (Bowman et al., 2015). formation, which preserves the ordering of model
confidences; this is sufficient for calculating a
Tuning We tune our prompt templates on the
precision-recall curve, so full probability distribu-
Levy/Holt dev set. We use a set of 4 most promis-
tions over all answer tokens (which are not always
ing prompt templates including the best template
provided by e.g. GPT) are not necessary.
from Schmitt and Schütze (2021), also used in
other NLI work5 (Webson and Pavlick, 2022). We 5 Experiment 1:
choose the best-performing prompt template on the Dominance of Veracity Prior
dev set to deploy on the test set.
Ideally, an LLM with advanced language under- We first assess LLMs’ reliance on memorization
standing capability could perform inference in zero- of training text by conditioning each model’s en-
shot without any annotated examples, which would tailment task predictions on its own predictions for
raise confidence that this faculty is readily avail- hypothesis veracity, which measures if a hypothesis
able in downstream tasks such as summarization or is attested in training data. We examine two scenar-
QA. To this end, we test GPT-3.5 and LLaMA in ios: the standard inference task (I), and the random
zero-shot on the datasets, but they exhibit severely premise task (IRP ). We compare model probabili-
degraded performance, even near random chance. ties of predicting Entail conditioned on whether
We turn to few-shot, and hand-annotate a mini- the hypothesis veracity is predicted as True or not,
mal 4 examples in the style of the template, with in order to evaluate the model outputs’ dependence
added explanations about why the given answer on the veracity prior.
is correct for each example. These examples are We further control for the possibility that origi-
prepended before the query (see Appendix A for nal Levy/Holt entailments may coincidentally re-
an example prompt). While the definition of “few- fer to attested facts, which could lead to spurious
shot” varies between works, we seek to use the correlation between inference and veracity scores
minimum number of examples necessary to evoke without demonstrating clearly use of memory ver-
a positive response on the dataset for each model; sus entailment capability. We apply this control by
our goal is to study model behavior as conditions comparing to the random premise task IRP , which
change, not to maximize the score on any particular converts entailments into non-entailments without
dataset. We observe that given 4 high-quality an- altering the hypothesis. An ideal model capable
notated examples, each model is able to score very of language understanding using the information
well on the dev portion, across most templates. provided in context should detect that in the IRP
task it is no longer possible to infer the hypothesis
Scoring We convert choice A into Entail and
based on the premise (even if the hypothesis is itself
collapse both B and C choices into No-Entail
attested in training), and never predict Entail.
to align with Levy/Holt annotation. For experi-
Thus, in the IRP task, all Entail predictions are
ments in §5, §6, and §7, we score the model solely
assumed to be false positive hallucinations.
based on its textual response. Inspection of dev
responses shows the models are compatible with 5.1 Results
the QA format and respond using A/B/C 100% of
With I, IRP and V predictions acquired as de-
the time.
scribed in §3.1, we present the results in Table 2.
For experiments in §8 which measures practical
It’s clear that a model’s memory about the hypothe-
model performance across confidence thresholds,
sis plays a part in its predictions when conditioned
we convert the letter choice to an “entailment score”
on a premise, either related or random.
with the mapping:
For I, we observe significantly increased proba-
Sent = 0.5 + 0.5 ∗ I[tok = A] ∗ Stok bility of predicting Entail when the hypothesis
is previously judged veracious.
− 0.5 ∗ I[tok ∈ {B, C}] ∗ Stok
In the random premise task IRP , this trend con-
5
See Appendix A for the full list of prompt templates. tinues. LLaMA, GPT-3.5, and PaLM, respectively,
Task Dev Sample
I (True) George Bush was
was
was the
the
the Governor
Governor of Texas ⇒ George Bush is
Governor of
of is
is aaa politician
politician
politician from
from
from Texas
IRP (False) George Bush resided
resided in Texas ⇒ George Bush is
resided in
in is
is aaa politician
politician
politician from
from
from Texas

Table 1: From the original dataset task (I) we derive the Random Premise task (IRP ), where the premise predicate
has been randomized while respecting type-constraints. A random premise predicate is highly unlikely to entail the
hypothesis, so all labels become False. Sample format: (Label) [premise] ⇒ [hypothesis].

Predictor LLaMA-65B GPT-3.5 PaLM-540B


P (I = Entail | V = True) 63.6 77.6 67.9
P (I = Entail | V ̸= True) 37.1 63.6 41.2
P (IRP = Entail | V = True) 39.7 41.3 39.9
P (IRP = Entail | V ̸= True) 20.7 18.8 19.9

Table 2: We estimate the probabilities of predicting an entailment in the original task (I) and random premise task
(IRP ), conditioned on the model’s own judgement of hypothesis veracity (V ). In IRP all judgements of Entail
are false positives (hallucinations).

show a 1.9x, 2.2x, and 2.0x higher chance of pre- answers could be illogical, contradictory, and could
dicting that a random premise falsely Entails misrepresent the views of the KB, or other harms.
the hypothesis if it already predicts the hypothesis Such poor use of in-context learning has already
is veracious. We further investigate the impact of been observed in specific domains like medicine
such hallucination on NLI performance in §8. (Jimenez Gutierrez et al., 2022).
This behavior is observed across model families In general, this is a risk for LLMs which (a) are
(LLaMA, GPT, and PaLM), establishing that it is deployed for tasks like QA by feeding novel text
due to pretraining rather than Instruction-tuning (e.g. a legal document) in-context as part of the
or RLHF, since LLaMA and PaLM have only un- user query, and (b) are trained on datasets which are
dergone pretraining. This behavior is undesirable, private or otherwise infeasibly large to read man-
because model predictions on NLI tasks should be ually, containing many facts and human opinions
based solely on general language understanding, unknowable to both the user and modeler.
and no prior knowledge. We may conclude that
memory of statements in training data is a signifi- 6 Experiment 2:
cant contributor to LLM inferences, and may be an Entities are Indices to LLM Memories
important source of LLM hallucination.
In §5, we have established that propositional mem-
ory explains a significant portion of false positives
5.2 Implications for Real Applications
in LLM inference predictions. In this section, we
Using prior knowledge as part of language infer- continue by showing the importance of entity IDs
ence has bad implications for the use of LLMs in in the process of LLMs’ memory recall.
real applications. We offer an example scenario As described in §3.3, we transform the
of a question-answering task where user questions Levy/Holt dataset into the IT A task with typed
are answered from a Knowledge Base (KB). In identifiers and two IRA tasks with type-constrained
typical formulations of this task, if a statement in random entities: the random-infrequent task IRA ↓,
the KB (premise) entails a user query (hypothe- and random-frequent task IRA ↑ (Table 3 shows
sis), the premise may be formulated into an answer. examples).
Consider a KB such as a legal document or HR By replacing only the entities with others of the
rulebook. Assume that the text is prepended to the same types, entailment labels do not change; how-
user query and presented to the LLM, as in other ever, the new samples should contain novel strings
works (Srinivasan et al., 2022). Given our findings, which are not attested in the training data. We ex-
we might observe the LLM hallucinate answers to pect that an ideal model, capable of generalizing
questions using information which is not presented predicate inference, would maintain its predictions
in the KB, but may have been read by the LLM in across all conditions; on the other hand, flawed
text from other sources during pretraining. These models utilizing the veracity prior would predict
Task Dev Sample
I (True) India
India rice ⇒ India
India exports tons of rice
rice India
India exports rice
rice
rice
IT A (True) location
location
location X
X
X exports tons of food
food Y ⇒ location
food Y
Y location
location XX
X exports food
food
food Y
Y
Y
IRA ↓ (True) Sloterdijk
Sloterdijk exports tons of oatmeal
Sloterdijk cookies ⇒ Sloterdijk
oatmeal cookies
oatmeal cookies Sloterdijk exports oatmeal
Sloterdijk oatmeal cookies
oatmeal cookies
cookies
IRA ↑ (True) Helsinki
Helsinki
Helsinki exports tons of Granny
Granny Smith ⇒ Helsinki
Granny Smith
Smith Helsinki
Helsinki exports Granny
Granny
Granny Smith
Smith
Smith

Table 3: An original dev sample (I) is transformed by insertion of entity types (IT A ), real entities sampled uniform
randomly from the 5% least frequent entities mentioned in NewsCrawl, constrained to the same entity type (IRA ↓),
and the same, from the 5% most frequent (IRA ↑). Sample format: (Label) [premise] ⇒ [hypothesis].

Levy/Holt (Directional)
Model Task Precision Recall ∆-Recall
I 67.0 68.4 0
IT A 69.0 66.9 -1.5
LLaMA-65B
IRA ↓ 64.0 63.8 -4.6
IRA ↑ 67.2 53.7 -14.7
I 62.4 92.3 0
GPT-3.5 IT A 65.1 75.7 -16.6
text-davinci-003 IRA ↓ 65.5 66.5 -25.8
IRA ↑ 68.8 55.3 -37.0
I 72.8 76.2 0
IT A 79.8 50.8 -25.4
PaLM-540B
IRA ↓ 69.5 58.7 -17.5
IRA ↑ 70.8 52.4 -23.8

Table 4: Scoring model outputs in different argument-replacement tasks. We indicate the highest and lowest recall
score across replacement settings, and note that recall decreases sharply across settings in all models.

fewer Entail labels, since entity IDs no longer higher recall score (GPT-3.5 @66.5) than frequent
identify statements from training. entities (IRA ↑) (GPT-3.5 @55.3).
These findings corroborate those from §5, that
6.1 Results LLMs use memory as part of inference, and addi-
We run the models on each dataset condition and tionally show that these memories are recalled us-
report results in Table 4. We notice two important ing the identity of entities acting as indices. These
phenomena across all three models, aligning with experiments demonstrate that too much prior ex-
our hypothesis of a flawed model: posure to an entity may interfere with model gen-
eralization when that entity is discussed in novel
First, we observe that all models’ behavior sig-
inferences: the more a model has read about an
nificantly changes in the same way when original
entity during pretraining, the less capable it is of
entities are replaced by either entity types or ran-
drawing novel natural language inferences involv-
dom real entities. Despite similar (or marginally
ing it, even though those inferences do not require
increasing) precision across conditions, recall de-
detailed knowledge of the entity ID.
grades drastically from original entities (I) (GPT-
As in §5, we observe a consistent effect across
3.5 @92.3) to random frequent entities (IRA ↑)
model families, indicating its root in LLM pretrain-
(GPT-3.5 @55.3). Type-placeholder IT A perfor-
ing. We also tried explicitly instructing LLMs to
mance also degrades in this way, showing that
ignore the veracity of individual statements but did
this is not a matter of poorly selected real entities,
not see significant improvement (see Appendix B).
but rather a loss of information from the original
dataset that models were using to answer questions.
7 Experiment 3:
Second, we observe a significant difference in Backoff to Relative Frequency Heuristic
performance between the two real entity conditions
IRA ↓ and IRA ↑, which are both composed of Our natural response to the veracity prior is to
unattested statements, but which contain entities test model capabilities while blocking the effects
that differ in typical corpus frequency. Infrequent of memory. Following from §6, we apply a fur-
entities (IRA ↓) yield better generalization and a ther type-argument transformation to IRP , yielding
Predictor LLaMA-65B GPT-3.5 PaLM-540B
P (IRP _T A = Entail | F = Win) 26.9 23.8 11.5
P (IRP _T A = Entail | F = Lose) 18.0 14.0 5.8
P (IRP = Entail | F = Win) 32.2 40.8 31.4
P (IRP = Entail | F = Lose) 28.4 29.6 24.9

Table 5: We estimate the probability of predicting that a randomized premise entails the hypothesis, either
with original arguments (IRP ) or typed identifiers (IRP _T A ), conditioned on the relative frequency of the pair of
lemmatized predicates (F ). In both IRP and IRP _T A , all judgements of Entail are false positives (hallucinations).

the IRP _T A task. With the identities of arguments premise and hypothesis. Again, the effect is consis-
masked, statements cannot be recalled about spe- tent across model families, revealing its root in the
cific entities to use on this task; additionally, simi- large-scale pre-training process, rather than model
larly to the IRP condition, the ground-truth label of peculiarities or fine-tuning.
each IRP _T A sample remains No-Entail, so all The IRP results show that the relative frequency
model predictions of Entail are false positives. heuristic has a weaker effect when entity-based
We verify that entity-based memory has been memories are available. This indicates a tension
blocked by running an analogous type-argument between V and F : memory may be used when
variant of the V task in §3.1 (VT A ), in which we di- available, and if not, the predicate pairing may be
rectly query the LM about the predicted veracity of attended to more closely.
a hypothesis, with entities replaced by typed identi-
fiers. For GPT-3.5, only 2 hypotheses in the VT A 8 Impact of Bias on Performance
task are predicted as veracious; for LLaMA, the
number is also only 111 / 1,784. This verifies that We have demonstrated two sources of hallucination
entity-specific memories are effectively blocked. by LLMs on inference tasks. We now assess their
After blocking memory, we observe that the aver- impact on model performance to quantify the risks
age probability of false positives among the LLMs of such hallucinations.
drops from 31.3% in the IRP condition to 16.6% We compare LLMs’ performance between sub-
in the IRP _T A condition. However, despite the en- sets of the Levy/Holt dataset that are consistent or
couraging general statistics, we also observe the adversarial to each factor. An entry P ⊨ H? is
emergence of another factor in LLM predictions, a consistent with a factor when the prediction with
relative frequency heuristic. We first calculate an the factor is the same as the gold entailment label;
attribute F for each Levy/Holt sample by querying conversely, it is adversarial to a factor when the
Google N-grams as in §3.1. F labels the confor- prediction with the factor disagrees with the label.
mance of the sample predicates to this heuristic. For this we again use the predictions from mod-
853 samples are F = W in (hypothesis predicate els’ veracity priors (V ) and the relative frequency
estimated to be at least 5x more corpus-frequent heuristic (F ). Subset statistics are in Table 6.
than premise), and 550 samples are F = Lose (at While earlier experiments scored model textual
least 5x less frequent). responses to characterize behavior change, we now
We run a similar experiment to §5 by calculat- use area under the precision-recall curve (AUC)
ing the model probabilities of reporting an entail- to summarize model performance over a tunable
ment in the IRP and IRP _T A tasks, conditioned on confidence threshold (scoring described in §4.2),
whether F is Win or Lose, shown in Table 5. which is better for measuring practical discrimina-
tive power. Following Li et al. (2022), we re-scale
7.1 Results AUC values to normalize over the label distribution,
We observe that the relative frequency heuristic (F ) yielding AU Cnorm values which assign random
is also a strong predictor of false positive rates (hal- classifiers 0% and perfect classifiers 100%.
lucinations), in IRP _T A , with a separation of 1.5x,
1.7x and 2.0x for LLaMA, GPT-3.5 and PaLM re- 8.1 Results
spectively. When samples conform to the heuristic, We report results in Table 7. Under the stan-
models are more likely to report an entailment, dard inference task I, the performance drop from
even though no semantic relation exists between VC ONSISTENT (VC ) to VA DVERSARIAL (VA ) is severe
Data Subset Criteria # of Entries
LLaMA GPT-3.5 PaLM
VC ONSISTENT (G = T rue ∧ V = T rue) ∨ (G = F alse ∧ V = F alse) 955 947 999
VA DVERSARIAL (G = T rue ∧ V = F alse) ∨ (G = F alse ∧ V = T rue) 829 837 785
FC ONSISTENT (G = T rue ∧ F = W in) ∨ (G = F alse ∧ F = Lose) 1,134
FA DVERSARIAL (G = T rue ∧ F = Lose) ∨ (G = F alse ∧ F = W in) 298

Table 6: Subsets defined by G (entailment label) with either V (hypothesis veracity prediction from each LLM) or
F (model-agnostic relative frequency heuristic). C ONSISTENT subsets align G with V /F . A DVERSARIAL subsets
misalign G with V /F .

Levy/Holt
Model Task VC VA diff. FC FA diff.
LLaMA-65B I 65.5 8.1 -57.4 41.4 34.3 -7.1
GPT-3.5 I 85.0 10.8 -74.2 52.6 41.0 -11.6
PaLM-540B I 79.1 31.5 -47.6 62.3 51.7 -10.6
LLaMA-65B IT A 52.1 34.4 -17.7 53.1 37.7 -15.4
GPT-3.5 IT A 67.1 18.8 -48.3 52.2 36.2 -16.0
PaLM-540B IT A 58.1 46.6 -11.5 58.2 44.8 -13.4

Table 7: LLM performance on subsets where V /F is Consistent/Adversarial to gold labels, measured with AU Cnorm
(0% = random chance performance). Decrease from VC /FC to VA /FA subsets are presented in the diff. columns.

for all 3 LLMs: they deteriorate from very good marginal impact on the comparisons. We show
classifiers to poor or even near-random ones. This details in Appendix Table 11.
fragility from the veracity prior can be alleviated Between pre-trained and instruction-tuned mod-
by masking entity IDs with type-identifiers (condi- els, the trends are the same. This suggests that the
tion IT A ), which reduces the performance drop. current LLM fine-tuning approaches either fail to
On the other hand, with the type replacements correct these behaviors or have overlooked them.
in IT A , LLMs are forced to focus on the pred- We call for attention to these behaviors to further
icates in each proposition. As a result, the im- narrow the residual performance gaps, and improve
pact of the relative frequency heuristic is intensi- model robustness when reasoning about language.
fied. From the standard inference task I to IT A ,
the performance gap between FC ONSISTENT (FC ) to 9 Conclusion
FA DVERSARIAL (FA ) subsets is widened for all LLMs. Across several major LLM families and experimen-
The differences are generally less dramatic in F - tal settings, we demonstrate two important factors
consistency subsets than in V -consistency subsets, in the performance of LLMs on natural language
partly because the relative frequency heuristic in- inference tasks, which may also manifest in ap-
volves pairs of predicates and may be more difficult plied tasks as hallucination. Contrary to claims of
for LLMs to capture, and potentially because fre- LLM general reasoning capabilities, we show that
quency measures with lemmatized predicates and much of this performance is achieved by (1) recall
Google N-gram are only a crude estimate of the of relevant memorizations and (2) corpus-based
actual frequencies in each LLM’s pre-train corpus. heuristics like term frequency. Since these factors
We note that for V -consistency comparisons, are reproduced in all families, we establish that
the difference in performance between entries with they originate in model pretraining.
aligned and misaligned V predictions could be in- We conclude that LLMs, though powerful, use
fluenced by model-specific idiosyncrasies such as unsatisfactory tools for the basic faculties of lan-
patterns in syntax or vocabulary, etc. We polled all guage understanding and inference. We propose
three LLMs for V predictions to acquire a majority preliminary approaches to help alleviate these prob-
vote Ṽ on sample veracity. Conditioning on Ṽ , we lems, but also argue that LLMs must make more
find consistency with Table 7 and sometimes larger progress before they can be relied on to reason in
effects, confirming that noise in V predictions has ways analogous to human beings.
Limitations SIGMOD International Conference on Management
of Data, SIGMOD ’08, page 1247–1250, New York,
In this paper, we have discussed two prominent NY, USA. Association for Computing Machinery.
sources of hallucination for LLMs when doing nat-
Samuel R. Bowman, Gabor Angeli, Christopher Potts,
ural language inference. We acknowledge that this and Christopher D. Manning. 2015. A large anno-
is not an exhaustive search of all the sources, where tated corpus for learning natural language inference.
further explorations could be done as future work. In Proceedings of the 2015 Conference on Empiri-
We also note that after ablating the factors dis- cal Methods in Natural Language Processing, pages
632–642, Lisbon, Portugal. Association for Compu-
cussed in this paper, there remains residual, unex-
tational Linguistics.
plained performance on NLI tasks. This residual
could be attributed to undiscovered biases or gener- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
alising inference capability. We leave the analysis Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
of this residual to future work.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
As discussed in Appendix A, we compared a Gretchen Krueger, Tom Henighan, Rewon Child,
range of popular LLM prompting techniques and Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
selected the most promising approach. We also do Clemens Winter, Christopher Hesse, Mark Chen, Eric
acknowledge that there could potentially be other Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
novel prompting techniques that could help the Alec Radford, Ilya Sutskever, and Dario Amodei.
LLMs resist the influence of the priors discussed 2020. Language models are few-shot learners.
in this paper. We identify this as an open question
and advocate for future research. Sébastien Bubeck, Varun Chandrasekaran, Ronen El-
dan, Johannes Gehrke, Eric Horvitz, Ece Kamar,
Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lund-
Ethics Statement berg, Harsha Nori, Hamid Palangi, Marco Tulio
Ribeiro, and Yi Zhang. 2023. Sparks of Artificial
This paper discusses two major sources of hallu- General Intelligence: Early experiments with GPT-4.
cination in LLM output when asked to perform ArXiv:2303.12712 [cs].
natural language inference, which we note is a ca-
Sharon A. Caraballo and Eugene Charniak. 1999. De-
pability required of many downstream tasks such
termining the specificity of nouns from text. In 1999
as summarization, question answering, etc. We Joint SIGDAT Conference on Empirical Methods in
show that users of LLMs may be subjected to faulty Natural Language Processing and Very Large Cor-
judgements if the content of their request overlaps pora.
with data in pretraining. However, it is difficult to
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski,
ascertain for both a user or modeler exactly what Katherine Lee, Florian Tramer, and Chiyuan Zhang.
is contained in pretraining data, or how this will 2023. Quantifying Memorization Across Neural Lan-
interact with a user’s query. Our proposed veracity guage Models. ArXiv:2202.07646 [cs].
prior shows promise in detecting potential overlaps,
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,
but model responses in applications of these cases Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
are not explored. Further, the relative frequency Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion
prior demonstrates a much more subtle problem Stoica, and Eric P. Xing. 2023. Vicuna: An open-
of corpus distribution that is naturally inherent to source chatbot impressing gpt-4 with 90%* chatgpt
quality.
model pretraining.
In light of these, the potential harms of LLM use Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
for drawing natural language inferences may in- Maarten Bosma, Gaurav Mishra, Adam Roberts,
clude: offering inaccurate or irrelevant information Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, Parker Schuh, Kensen Shi,
to a user’s query or contradiction of information Sasha Tsvyashchenko, Joshua Maynez, Abhishek
provided in-context with a user’s query. Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-
odkumar Prabhakaran, Emily Reif, Nan Du, Ben
Hutchinson, Reiner Pope, James Bradbury, Jacob
References Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat,
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sunipa Dev, Henryk Michalewski, Xavier Garcia,
Sturge, and Jamie Taylor. 2008. Freebase: A col- Vedant Misra, Kevin Robinson, Liam Fedus, Denny
laboratively created graph database for structuring Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim,
human knowledge. In Proceedings of the 2008 ACM Barret Zoph, Alexander Spiridonov, Ryan Sepassi,
David Dohan, Shivani Agrawal, Mark Omernick, An- Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish
drew M. Dai, Thanumalayan Sankaranarayana Pil- Sabharwal. 2018. Can a suit of armor conduct elec-
lai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, tricity? a new dataset for open book question answer-
Rewon Child, Oleksandr Polozov, Katherine Lee, ing. In Conference on Empirical Methods in Natural
Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Language Processing.
Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy
Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, D.B. Nguyen, Johannes Hoffart, M. Theobald, and
and Noah Fiedel. 2022. Palm: Scaling language mod- G. Weikum. 2014. Aida-light: High-throughput
eling with pathways. named-entity disambiguation. volume 1184.

Dan Hendrycks, Collin Burns, Steven Basart, Andy OpenAI. 2023. GPT-4 Technical Report.
Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- ArXiv:2303.08774 [cs].
hardt. 2021. Measuring massive multitask language
understanding. Proceedings of the International Con- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
ference on Learning Representations (ICLR). roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, John
Bernal Jimenez Gutierrez, Nikolas McNeal, Clayton Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Washington, You Chen, Lang Li, Huan Sun, and Maddie Simens, Amanda Askell, Peter Welinder,
Yu Su. 2022. Thinking about GPT-3 in-context learn- Paul Christiano, Jan Leike, and Ryan Lowe. 2022.
ing for biomedical IE? think again. In Findings of the Training language models to follow instructions with
Association for Computational Linguistics: EMNLP human feedback. ArXiv:2203.02155 [cs].
2022, pages 4497–4512, Abu Dhabi, United Arab
Emirates. Association for Computational Linguistics. Adam Poliak, Jason Naradowsky, Aparajita Haldar,
Rachel Rudinger, and Benjamin Van Durme. 2018.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Hypothesis only baselines in natural language infer-
field, Michael Collins, Ankur Parikh, Chris Alberti, ence. In Proceedings of the Seventh Joint Confer-
Danielle Epstein, Illia Polosukhin, Matthew Kelcey, ence on Lexical and Computational Semantics, pages
Jacob Devlin, Kenton Lee, Kristina N. Toutanova, 180–191, New Orleans, Louisiana. Association for
Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Computational Linguistics.
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
ral questions: a benchmark for question answering Martin Schmitt and Hinrich Schütze. 2021. Language
research. Transactions of the Association of Compu- Models for Lexical Inference in Context. In Proceed-
tational Linguistics. ings of the 16th Conference of the European Chap-
ter of the Association for Computational Linguistics:
Tianyi Li, Mohammad Javad Hosseini, Sabine Weber, Main Volume, pages 1267–1280, Online. Association
and Mark Steedman. 2022. Language Models Are for Computational Linguistics.
Poor Learners of Directional Inference. In Findings
of the Association for Computational Linguistics: Krishna Srinivasan, Karthik Raman, Anupam Samanta,
EMNLP 2022, pages 903–921, Abu Dhabi, United Lingrui Liao, Luca Bertelli, and Michael Bendersky.
Arab Emirates. Association for Computational Lin- 2022. QUILL: Query intent with large language mod-
guistics. els using retrieval augmentation and multi-stage dis-
tillation. In Proceedings of the 2022 Conference on
Xiao Ling and Daniel S. Weld. 2012. Fine-grained en- Empirical Methods in Natural Language Processing:
tity recognition. In Proceedings of the Twenty-Sixth Industry Track, pages 492–501, Abu Dhabi, UAE.
AAAI Conference on Artificial Intelligence, AAAI’12, Association for Computational Linguistics.
page 94–100. AAAI Press.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Dubois, Xuechen Li, Carlos Guestrin, Percy
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Liang, and Tatsunori B. Hashimoto. 2023. Stan-
Luke Zettlemoyer, and Veselin Stoyanov. 2019. ford alpaca: An instruction-following llama
RoBERTa: A Robustly Optimized BERT Pretrain- model. https://github.com/tatsu-lab/
ing Approach. arXiv:1907.11692 [cs]. ArXiv: stanford_alpaca.
1907.11692.
Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer,
Nick McKenna, Liane Guillou, Mohammad Javad Hos- and Armen Aghajanyan. 2022. Memorization with-
seini, Sander Bijl de Vroe, Mark Johnson, and Mark out overfitting: Analyzing the training dynamics of
Steedman. 2021. Multivalent entailment graphs for large language models. In Advances in Neural Infor-
question answering. In Proceedings of the 2021 Con- mation Processing Systems, volume 35, pages 38274–
ference on Empirical Methods in Natural Language 38290. Curran Associates, Inc.
Processing, pages 10758–10768, Online and Punta
Cana, Dominican Republic. Association for Compu- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
tational Linguistics. Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Nick McKenna and Mark Steedman. 2022. Smooth- Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
ing entailment graphs with language models. Grave, and Guillaume Lample. 2023. Llama: Open
ArXiv:2208.00318v1 [cs.CL]. and efficient foundation language models.
Albert Webson and Ellie Pavlick. 2022. Do prompt- In preliminary experiments with GPT-3.5, we ob-
based models really understand the meaning of their served that LLMs are not responsive to the 3 contra-
prompts? In Proceedings of the 2022 Conference of
positive prompts from Schmitt and Schütze (2021)
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech- (colored gray), performing at random. We also
nologies, pages 2300–2344, Seattle, United States. observed that prompt number 5 from Schmitt and
Association for Computational Linguistics. Schütze (2021) also consistently underperforms the
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten other 4 templates, so we use the remaining 4 tem-
Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, plates (namely, template no. 1, 2, 3, 4) as our final
and Denny Zhou. 2022. Chain of Thought Prompt- candidate set.
ing Elicits Reasoning in Large Language Models.
ArXiv:2201.11903 [cs] version: 1. In-context Examples have been widely used for
interactions with LLMs since the seminal work of
Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang,
Kathleen McKeown, and Tatsunori B. Hashimoto. Brown et al. (2020). Further, Wei et al. (2022)
2023. Benchmarking large language models for news has demonstrated that including chain-of-thoughts,
summarization. namely step-by-step explanations, in the in-context
examples, helps LLMs perform reasoning tasks.
On the other hand, Ouyang et al. (2022) has sug-
A Prompt Format Selection
gested that instruction-tuned LLMs are also capa-
In prompt-based interactions with the LLMs, sev- ble of performing tasks in zero-shot, without expo-
eral types of context information could be added sure to any in-context examples.
to help models produce accurate and robust predic- We compared zero-shot and few-shot in our pre-
tions. We attend to two design choices in prompt liminary experiments with LLaMA and GPT-3.5 on
engineering: prompt templates and in-context ex- Levy/Holt directional dev set. Following Touvron
amples. et al. (2023), for zero-shot, we prepend a textual
description of the task to each test sample; for few-
Prompt templates are known to have a direct
shot, we prepend a minimal 4 examples with ex-
and sometimes decisive impact on LLM behavior.
planations. Instantiated prompts in the two settings
As such, we carefully select a range of clear and
are demonstrated in Table 8. Here we report the
concise templates as promising candidates. As
dev set results with the best-performing templates.
discussed in §4.2, we run each template through
We found that for LLaMA, the model’s zero-shot
the dev sets of each dataset, and select the template
performance on the Levy/Holt directional dev set
with the best discriminative power according to
is near-random, at 56.6% AU C (random is 50%);
AUC scores (similarly to §8). The candidate set of
with 4 in-context examples, the model begins to ex-
templates includes 3 concise templates we wrote:
hibit non-trivial behavior, with 65.0% AU C. This
1. If [PREMISE], then [HYPOTHESIS]. is not surprising, since LLaMA is only a pre-trained
LLM without any instruction fine-tuning. For GPT-
2. PREMISE, so HYPOTHESIS. 3.5, the performance is still much lower in zero-
shot, at 64.5%, compared to 74.6% in few-shot.
3. PREMISE entails HYPOTHESIS. As discussed in §4.2, ideally we would like
We also considered the 5 prompt templates LLMs to have zero-shot natural language abili-
used in prior work on LMs for textual entailments ties readily available for downstream tasks. How-
(Schmitt and Schütze, 2021): ever, in light of this observation, our primary ex-
periments are conducted in the few-shot setting
4. PREMISE, which means that HYPOTHESIS. throughout, in order to reveal the abilities of these
LLMs to the fullest.
5. HYPOTHESIS, because PREMISE.
B The Ineffectiveness of Instructing
6. It is not the case that [HYPOTHESIS], let alone
that [PREMISE].
LLMs to Stop Attending to Veracity
In §5 and §6, we showed that entailment predic-
7. [HYPOTHESIS]N EG , which means that
tions from LLMs are strongly biased by their pre-
[PREMISE]N EG .
dictions on the veracity of the hypotheses. We
8. [PREMISE]N EG , because [HYPOTHESIS]N EG . wondered whether there are intuitive prompt engi-
neering techniques to steer its behavior away from in general. Therefore, with the majority vote, we
attending to veracity. mask these noises and acquire sound predictions
Towards this goal, we experimented with pre- on the grounded veracity of statements.
pending a line of task description to the few-shot Performances of LLMs between Ṽ -consistency
prompts in part B of Table 8, explicitly instruct- subsets are listed in Table 11. Gaps between
ing the models to ignore the veracity of individual the Ṽ -consistency subsets that are larger than V -
statements: Please check the entailments between consistency gaps are colored red; those narrower
the following hypothetical statements. Ignore the than V -consistency gaps are colored green. It is
veracity of these statements.. clear that the gaps are consistent between V /Ṽ -
We replicated the experiments in §5 and §6 with consistency experiments, where the gaps are even
GPT-3.5, but the results show only marginal im- larger on many occasions. This confirms, that the
provements in model behavior. performance gaps in V -consistency experiments
In Table 9, we show that instructing GPT-3.5 can be credited to the veracity prior, rather than
to ignore veracity does not help narrow the gap model-specific idiosyncrasies.
between V = T rue and V = F alse; instead, It is also to be noted that, since the F -consistency
ratios of positive predictions went down by similar subsets are separated based on the model-agnostic
amounts, indicating that the model is becoming criterion F , model-specific idiosyncrasies are not a
slightly more conservative in predicting positives problem for F -consistency comparisons.
when instructed to ignore veracity, but not in a
principled manner. D Impacts of Bias on GPT-4 Performance
Further, as shown in Table 10, despite the
GPT-4 (OpenAI, 2023) is a recent strong LLM
explicit instruction, recall still drops at similar
claiming SOTA performance on various NLP tasks.
scales when arguments are randomly replaced
Due to its closed-source nature and the impossibil-
with the same sets of frequent/infrequent replace-
ity of fully tracking the sources of its behaviors, we
ment entities as before. Since GPT-3.5 is an
refrain from reporting results with it in the main
instruct-finetuned model trained to be responsive
content of this paper.
to prompts, its failure means eradicating such bi-
However, in order to provide a richer context
ases from model outputs is a difficult task, one that
for the veracity prior and the Relative Frequency
needs further research attention.
Heuristic, in this section we report the perfor-
C The Reliability of V Measure and Its mance differences of GPT-4 between subsets con-
Relation to Grounded Veracity sistent/adversarial to the two factors.
As a light-weight experiment, we elicit GPT-4
The V -consistency subsets most directly capture predictions in the original I task in the zero-shot
the impacts of the veracity prior. However, as dis- setting, and re-use subsets from experiments in
cussed in §8.1, these subset separations are based §8. Specifically, for the veracity prior, we use
on V predictions from individual models, which the majority vote Ṽ among LLaMA, GPT-3.5 and
can be noisy, subject to model-specific noise such PaLM, to approximate V predictions from GPT-4
as trigger strings or responses to certain syntax itself; for the relative frequency heuristic, we keep
structures in the hypotheses, etc. the F measure for approximating corpus-frequency
To verify that the performance gaps in V - of terms.
consistency subsets that we observe in §8.1 comes Because GPT-4 is a commercial service and does
from predicted veracity and not any of the noise not provide logit confidence with their discrete pre-
sources, we experiment with another pair of subsets dictions, AU Cnorm values could not be calculated.
based on grounded veracity instead of predicted ve- Therefore, we are forced to report the F-1 scores
racity. at the binary prediction point of confidence. As
We use a majority vote among the three results in Table 12 show, we observe the same trend
independently-trained LLMs to approximate as in §8: for the subset adversarial to each factor,
grounded veracity, the approximation is denoted GPT-4 performance also drops substantially.
as Ṽ . This is because, any model-specific idiosyn- This experiment is designed to provide more
crasies should not be shared between LLMs in- context for the two factors discussed in the paper
dependently trained from different source corpora and NOT to compare GPT-4 with other models;
however, we can conclude that GPT-4 is subject to
the same fragilities as the other LLMs w.r.t. the
two factors, where our conclusions and recommen-
dations also apply.
A. Zero-shot Example Instantiated Prompt
Please check the entailments between the following statements.

If kanamycin kills infections, then kanamycin is useful in infections.


A) Entailment
B) Neutral
C) Contradiction
B. Few-shot Example Instantiated Prompt
If Google bought Youtube, then Google owns Youtube.
A) Entailment
B) Neutral
C) Contradiction
Answer: A) Entailment. Owning is a consequence of buying.
If Google owns Youtube, then Google bought Youtube.
A) Entailment
B) Neutral
C) Contradiction
Answer: B) Neutral. Owning does not imply buying, the ownership may come from other means.
If John went to the mall, then John drove to the mall.
A) Entailment
B) Neutral
C) Contradiction
Answer: B) Neutral. John may have gone to the mall by other means.
If John drove to the mall, then John went to the mall.
A) Entailment
B) Neutral
C) Contradiction
Answer: A) Entailment. Driving is a means of going to the mall.
If ephedrine is widely used in medicine, then ephedrine is used in medicine.
A) Entailment
B) Neutral
C) Contradiction
Answer:
C. Hypothesis-only Example Instantiated Prompt
Google bought Youtube.
A) True
B) Unknown
C) False
Answer: A) True.
Yoshua Bengio likes oak trees.
A) True
B) Unknown
C) False
Answer: B) Unknown.
The sun rises from the west.
A) True
B) Unknown
C) False
Answer: C) False.
ephedrine is used in medicine.
A) True
B) Unknown
C) False
Answer:

Table 8: Example instantiated prompts in Zero-shot / Few-shot settings, for the test entry “PREMISE: [ephedrine
is widely used in medicine], HYPOTHESIS: [ephedrine is used in medicine]”. The few-shot prompts in part B are
used throughout the main experiments in this paper. We also present an example of the prompts we use for the
hypothesis-only V measure as described in §3.1.
GPT-3.5 Instructed to Ignore Veracity Not Instructed
P (I = Entail | V = True) 74.3 77.6
P (I = Entail | V ̸= True) 57.8 63.6
P (IRP = Entail | V = True) 39.0 41.3
P (IRP = Entail | V ̸= True) 17.6 18.8

Table 9: We estimate the probability of positive predictions in I and IRP tasks respectively given that the hypothesis
is predicted as veracious, namely V = T rue. Not instructed results are borrowed from Table 2 and listed here for
ease of comparison; also note that all IRP = Entail predictions are false positives.

Levy/Holt (Directional)
GPT-3.5 Condition Task Precision Recall ∆-Recall
I 64.9 90.8 0
Few-shot, instructed to ignore veracity. IRA ↓ 64.6 68.4 -22.4
IRA ↑ 67.5 58.1 -32.7
I 62.4 92.3 0
Few-shot, no instructions. IRA ↓ 65.5 66.5 -25.8
IRA ↑ 68.8 55.3 -37.0

Table 10: GPT-3.5 predictions when models are explicitly instructed to avoid taking the veracity of individual
statements into account. In the upper half are the instructed behavior, and in the lower half are the regular few-shot
behavior as in Table 4. Differences in recalls remain at a similar scale, with precision again stable, where the benefit
from the explicit instruction is marginal.
Levy/Holt
Model Task ṼC ṼA diff.
LLaMA-65B I 65.3 6.5 -58.8
GPT-3.5 I 70.8 23.5 -47.3
PaLM-540B I 80.7 28.3 -52.4
LLaMA-65B IT A 54.4 29.6 -24.8
GPT-3.5 IT A 56.2 35.5 -20.7
PaLM-540B IT A 59.3 40.1 -19.2

Table 11: LLM performance on Levy/Holt subsets


where Veracity Ṽ is Consistent/Adversarial to the la-
bels, measured with AU Cnorm (0% = random chance
performance). Performance drops from ṼC to ṼA are
presented in the diff. columns, sharper decreases than
V -comparisons in Table 7 are colored red, milder ones
are colored green.

F-1 score Task Levy/Holt


ṼC ṼA
random baseline - 70.3 62.0
GPT-4 I 85.1 (+14.8) 67.6 (+5.6)
FC FA
random baseline - 66.7 66.7
GPT-4 I 74.6 (+7.9) 69.7 (+3.0)

Table 12: LLM performance on Levy/Holt subsets


where Veracity Ṽ is Consistent/Adversarial to the la-
bels, measured with F-1 score. random baseline is the
highest F-1 score from a random classifier, by reaching
random precision and 100% recall. For each GPT-4
score, we also show the improvement over random (in
parentheses).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy