0% found this document useful (0 votes)
19 views18 pages

RLAIF: Scaling Reinforcement Learning From Human Feedback With AI Feedback

The document presents a technique called Reinforcement Learning from AI Feedback (RLAIF) where an off-the-shelf large language model provides preference labels instead of humans for training a reward model used in reinforcement learning fine-tuning. The authors compare RLAIF to traditional Reinforcement Learning from Human Feedback (RLHF) on a text summarization task and find that RLAIF achieves comparable performance to RLHF based on human evaluations. They also study methods for improving the alignment of AI-generated preferences with humans, finding that prompting the model with detailed instructions and multi-step reasoning leads to better alignment than few-shot learning or self-consistency.

Uploaded by

shirfil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views18 pages

RLAIF: Scaling Reinforcement Learning From Human Feedback With AI Feedback

The document presents a technique called Reinforcement Learning from AI Feedback (RLAIF) where an off-the-shelf large language model provides preference labels instead of humans for training a reward model used in reinforcement learning fine-tuning. The authors compare RLAIF to traditional Reinforcement Learning from Human Feedback (RLHF) on a text summarization task and find that RLAIF achieves comparable performance to RLHF based on human evaluations. They also study methods for improving the alignment of AI-generated preferences with humans, finding that prompting the model with detailed instructions and multi-step reasoning leads to better alignment than few-shot learning or self-consistency.

Uploaded by

shirfil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI

Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard,
Colton Bishop, Victor Carbune, Abhinav Rastogi
Google Research
{harrisonlee,samratph,hassan}@google.com

Abstract

Reinforcement learning from human feedback


(RLHF) is effective at aligning large language
models (LLMs) to human preferences, but gath-
arXiv:2309.00267v1 [cs.CL] 1 Sep 2023

ering high-quality human preference labels is


a key bottleneck. We conduct a head-to-head
comparison of RLHF vs. RL from AI Feed-
back (RLAIF) - a technique where preferences
are labeled by an off-the-shelf LLM in lieu of
humans, and we find that they result in similar
improvements. On the task of summarization,
human evaluators prefer generations from both
RLAIF and RLHF over a baseline supervised Figure 1: Human evaluators strongly prefer RLHF and
fine-tuned model in ∼70% of cases. Further- RLAIF summaries over the supervised fine-tuned (SFT)
more, when asked to rate RLAIF vs. RLHF baseline. The differences in win rates between RLAIF vs.
summaries, humans prefer both at equal rates. SFT and RLHF vs. SFT are not statistically significant.
These results suggest that RLAIF can yield Additionally, when compared head-to-head, RLAIF is
human-level performance, offering a potential equally preferred to RLHF by human evaluators. Error
solution to the scalability limitations of RLHF. bars denote 95% confidence intervals.

1 Introduction
technique called "Reinforcement Learning from
Reinforcement Learning from Human Feedback AI Feedback" (RLAIF)1 . While they showed that
(RLHF) is an effective technique for aligning lan- utilizing a hybrid of human and AI preferences
guage models to human preferences (Stiennon in conjunction with the "Constitutional AI" self-
et al., 2020; Ouyang et al., 2022) and is cited as revision technique outperforms a supervised fine-
one of the key drivers of success in modern conver- tuned baseline, their work did not directly compare
sational language models like ChatGPT and Bard the efficacy of human vs. AI feedback, leaving
(Liu et al., 2023; Manyika, 2023). By training the question unanswered whether RLAIF can be a
with reinforcement learning (RL), language mod- suitable alternative to RLHF.
els can be optimized on complex, sequence-level In this work, we directly compare RLAIF against
objectives that are not easily differentiable with RLHF on the task of summarization. Given a text
traditional supervised fine-tuning. and two candidate responses, we assign a prefer-
The need for high-quality human labels is an ence label using an off-the-shelf LLM. We then
obstacle for scaling up RLHF, and one natural train a reward model (RM) on the LLM prefer-
question is whether artificially generated labels can ences with a contrastive loss. Finally, we fine-tune
achieve comparable results. Several works have a policy model with reinforcement learning, using
shown that large language models (LLMs) exhibit 1
We use "RLAIF" to denote training a reward model on AI-
a high degree of alignment with human judgment - labeled preferences followed by conducting RL fine-tuning.
even outperforming humans on some tasks (Gilardi This is distinct from "Constitutional AI", which improves
et al., 2023; Ding et al., 2023). Bai et al. (2022b) upon a supervised learning model through iteratively asking an
LLM to generate better responses according to a constitution.
was the first to explore using AI preferences to Both were introduced in Bai et al. (2022b) and are sometimes
train a reward model used for RL fine-tuning - a confused for one another.
Figure 2: A diagram depicting RLAIF (top) vs. RLHF (bottom)

the RM to provide rewards. • We compare various techniques for generat-


Our results show that RLAIF achieves compara- ing AI labels and identify optimal settings for
ble performance to RLHF, measured in two ways. RLAIF practitioners
First, we observe that both RLAIF and RLHF poli-
cies are preferred by humans over a supervised 2 Preliminaries
fine-tuned (SFT) baseline 71% and 73% of the We first review the RLHF pipeline introduced in
time, respectively, and the two win rates are not Stiennon et al. (2020); Ouyang et al. (2022), which
statistically significantly different. Second, when consists of 3 phases: supervised fine-tuning, reward
asked to directly compare generations from RLAIF model training, and reinforcement learning-based
vs. RLHF, humans prefer both at equal rates (i.e. fine-tuning.
50% win rate). These results suggest that RLAIF is
a viable alternative to RLHF that does not depend 2.1 Supervised Fine-tuning
on human annotation and offers appealing scaling A pre-trained LLM is fine-tuned on a high qual-
properties. ity labeled dataset for a downstream task using
Additionally, we study techniques to maximize token-level supervision to produce a supervised
the alignment of AI-generated preferences with hu- fine-tuned (SFT) model π SF T .
man preferences. We find that prompting our LLM
with detailed instructions and soliciting chain-of- 2.2 Reward Modeling
thought reasoning improve alignment. Surprisingly, Given an input x, we sample a pair of responses
we observe that both few-shot in-context learning from one or more models (y1 , y2 ) ∼ π, where of-
and self-consistency - a process in which we sam- tentimes the SFT model is used. The input and
ple multiple chain-of-thought rationales and aver- responses are sent to human annotators to rate
age the final preferences - do not improve accuracy which response is better according to some cri-
or even degrade it. Finally, we conduct scaling teria. These annotations form a dataset of triplets
experiments to quantify the trade-offs between the D = {(x, yw , yl )}, where yw and yl are the pre-
size of the LLM labeler and the number of prefer- ferred and non-preferred responses, respectively.
ence examples used in training vs. alignment with A reward model rϕ is trained by minimizing the
human preferences. following loss:
Our main contributions are the following:

• We demonstrate that RLAIF achieves com- Lr (ϕ) = −E [log σ(rϕ (x, yw ) − rϕ (x, yl ))]
(x,yw ,yl )∼D
parable performance to RLHF on the task of
summarization where σ is the sigmoid function.
Preamble A good summary is a shorter piece of text that has the
essence of the original. ... Given a piece of text and two
of its possible summaries, output 1 or 2 to indicate which
summary best adheres to coherence, accuracy, coverage, and
overall quality as defined above.

1-Shot Exemplar »»»» Example »»»»

Text - We were best friends over 4 years ...


Summary 1 - Broke up with best friend, should I wish her a
happy birthday... And what do you think of no contact?
Summary 2 - should I wish my ex happy birthday, I broke no
contact, I’m trying to be more patient, I’m too needy, and I
don’t want her to think I’ll keep being that guy.

Preferred Summary=1

»»»» Follow the instructions and the example(s) above »»»»

Sample to Annotate Text - {text}


Summary 1 - {summary1}
Summary 2 - {summary2}

Ending Preferred Summary=

Table 1: An example of a prompt fed to an off-the-shelf LLM to generate AI preference labels. "{text}", "{sum-
mary1}", and "{summary2}" are populated with unlabeled examples, and a preference distribution is obtained by
computing the softmax of the log probabilities of generating the tokens "1" vs. "2".

2.3 Reinforcement Learning pre-trained or instruction-tuned for general usage


A policy πθRL is initialized from the SFT model but not fine-tuned for a specific downstream task.
weights and then optimized with reinforcement Given a piece of text and two candidate summaries,
learning to maximize the reward given by the RM, the LLM is asked to rate which summary is bet-
which serves as a proxy for human preferences. ter. The input to the LLM is structured as follows
Optionally, a Kullback-Leibler (KL) divergence (example in Table 1):
loss DKL is added to the objective to penalize πθRL
1. Preamble - Introduction and instructions de-
for deviating from the original SFT policy π SF T ,
scribing the task at hand
controlled by the hyperparameter β - a technique
similar to natural policy gradients (Kakade, 2001). 2. Few-shot exemplars (optional) - An example
The KL loss helps prevent πθRL from drifting into of a text, a pair of summaries, a chain-of-
a region where it generates language that is highly thought rationale (if applicable), and a prefer-
rewarded by the RM yet consists of low-quality or ence judgment
unnatural language - a phenomenon known as "re-
ward hacking" (Everitt and Hutter, 2016; Amodei 3. Sample to annotate - A text and a pair of sum-
et al., 2016). The full optimization objective is maries to be labeled
described by the equation below: 4. Ending - An ending string to prompt the LLM
(e.g. "Preferred Summary=")
max E[rϕ (y|x) − βDKL (πθRL (y|x)||π SF T (y|x))]
θ After the LLM is given the input, we obtain the
log probabilities of generating the tokens "1" and
3 RLAIF Methodology
"2" and compute the softmax to derive a preference
In this section, we describe the techniques used to distribution.
generate preference labels with an LLM, how we There are numerous alternatives to obtain pref-
conduct RL, and evaluation metrics. erence labels from LLMs, such as decoding a free-
form response from the model and extracting the
3.1 Preference Labeling with LLMs preference heuristically (e.g. output = "The first
We annotate preferences among pairs of candidates summary is better"), or representing the preference
with an "off-the-shelf" LLM, which is a model distribution as a one-hot representation. However,
we did not experiment with these alternatives be- 3.1.3 Self-Consistency
cause our approach already yielded high accuracy. For chain-of-thought prompts, we also experiment
We experiment with two types of preambles: with self-consistency - a technique to improve upon
"Base", which briefly asks "which summary is bet- chain-of-thought reasoning by sampling multiple
ter?", and "OpenAI", which closely mimics the reasoning paths and aggregating the final answer
rating instructions given to the human preference produced at the end of each path (Wang et al.,
annotators that generated the OpenAI TL;DR pref- 2022). Multiple chain-of-thought rationales are
erence dataset and contains detailed information sampled with a non-zero decoding temperature,
about what constitutes a strong summary (Stiennon and then LLM preference distributions are obtained
et al., 2020). See Appendix Table 6 for full pream- for each one - following the approach in Section
bles. We also experiment with in-context learning 3.1.2. The results are then averaged to obtain the
by adding few-shot exemplars in the prompt, where final preference distribution.
exemplars were manually selected to cover differ-
ent topics. 3.2 Reinforcement Learning from AI
Feedback
3.1.1 Addressing Position Bias After preferences are labeled by the LLM, a re-
ward model (RM) is trained to predict prefer-
The order in which candidates are shown to ences. Since our approach produces soft labels
the LLM can bias which candidate it prefers (e.g. pref erencesi = [0.6, 0.4]), we apply a
(Pezeshkpour and Hruschka, 2023). We find ev- cross-entropy loss to the softmax of the reward
idence that such a position bias exists, especially scores generated by the RM instead of the loss
with smaller sizes of LLM labelers (see Appendix mentioned in Section 2.2. The softmax converts
A). the unbounded scores from the RM into a probabil-
To mitigate position bias in preference labeling, ity distribution.
we make two inferences for every pair of candi- Training a RM on a dataset of AI labels can be
dates, where the order in which candidates are pre- viewed as a form of model distillation, especially
sented to the LLM is reversed for the second in- since our AI labeler is often larger and more pow-
ference. The results from both inferences are then erful than our RM. An alternative approach is to
averaged to obtain the final preference distribution. bypass the RM and use AI feedback directly as a
reward signal in RL, though this approach is more
3.1.2 Chain-of-thought Reasoning computationally expensive since the AI labeler is
larger than the RM.
We experiment with eliciting chain-of-thought With the trained RM, we conduct reinforcement
(COT) reasoning from our AI labelers to improve learning with a modified version of the Advantage
alignment with human preferences (Wei et al., Actor Critic (A2C) algorithm adapted to the lan-
2022). We replace the Ending of the standard guage modeling domain (Mnih et al., 2016) (de-
prompt (i.e. "Preferred Summary=") with "Con- tails in Appendix B). While many recent works use
sider the coherence, accuracy, coverage, and over- Proximal Policy Optimization (PPO) (Schulman
all quality of each summary and explain which one et al., 2017) - a similar method that adds a few
is better. Rationale:" and then decode a response techniques to make training more conservative and
from the LLM. Finally, we concatenate the origi- stable (e.g. clipping the objective function), we uti-
nal prompt, the response, and the original Ending lize A2C given that it is simpler yet still effective
string "Preferred Summary=" together, and follow for our problem.
the scoring procedure in Section 3.1 to obtain a
preference distribution. See Figure 3 for an illus- 3.3 Evaluation
tration. We evaluate our results with three metrics - AI
In zero-shot prompts, the LLM is not given an Labeler Alignment, Pairwise Accuracy, and Win
example of what reasoning should look like, while Rate.
in few-shot prompts, we provide examples of COT AI Labeler Alignment measures the accuracy of
reasoning for the model to follow. See Tables 7 and AI-labeled preferences with respect to human pref-
8 for examples. erences. For a single example, it is calculated by
Figure 3: To derive an AI preference label, the LLM is first prompted to verbally explain its thoughts on the quality
of the two candidates (blue). The LLM response is then appended to the original prompt (orange) and fed to the
LLM a second time to generate a preference distribution over "1" vs. "2" based on their log probabilities (green).

converting a soft AI-labeled preference to a binary 4 Experimental Details


representation (e.g. pref erencesi = [0.6, 0.4] →
4.1 Datasets
[1, 0]), and then assigning a 1 if the label agrees
with the target human preference and 0 otherwise. Following the work of Stiennon et al. (2020), we
It can be expressed as follows: use the filtered Reddit TL;DR dataset curated by
OpenAI. TL;DR contains ∼3 million posts from
Reddit2 across a variety of topics (also known as
"subreddits") alongside summaries of the posts
1[arg maxx praii = arg maxx prhi ] written by the original authors. The data is addi-
PD
Align = i=1
|D| tionally filtered by OpenAI to ensure high quality,
which includes using a whitelist of subreddits that
are understandable to the general population. Ad-
where prai and ph are binary representations of AI ditionally, only posts where the summaries contain
and human preferences, respectively, x is an index, between 24 and 48 tokens are included. The fil-
and D is a dataset. tered dataset contains 123,169 posts, where ∼5%
Pairwise Accuracy measures how accurate a is held out as a validation set. More details on the
trained reward model is with respect to a held-out dataset can be found in the original paper.
set of human preferences. Given a shared context Additionally, OpenAI curated a human prefer-
and pair of candidate responses, the Pairwise Accu- ence dataset from the filtered TL;DR dataset. For
racy is 1 if the RM scores the preferred candidate a given post, two candidate summaries were gen-
higher than the non-preferred candidate, accord- erated from different policies, and labelers were
ing to the human label. Otherwise the value is 0. asked to rate which summary they preferred. The
This quantity is averaged over multiple examples total dataset comprises ∼92k pairwise compar-
to measure the total accuracy of the RM. isons.
Win Rate evaluates the end-to-end quality of two 4.2 LLM Labeling
policies by measuring how often one policy is pre-
For evaluating the efficacy of AI labeling tech-
ferred by humans over another one. Given an in-
niques (e.g. prompting, self-consistency), we se-
put and two generations, human annotators select
lect examples from the TL;DR preference dataset
which generation is preferred. The percentage of
where human annotators preferred one summary
instances where policy A is preferred over policy
2
B is referred to as the "Win Rate of A vs. B". www.reddit.com
over the other with higher confidence3 . We evalu- 5 Results
ate AI Labeler Alignment on a random 15% subset
5.1 RLAIF vs. RLHF
of the training split of the dataset to enable faster
experiment iteration, yielding 2851 examples for Our results show that RLAIF achieves similar per-
evaluation. For reward model training, the full formance to RLHF (see Table 1). RLAIF is pre-
training split of the TL;DR preference dataset is la- ferred by human evaluators over the baseline SFT
beled by the LLM and used for training - regardless policy 71% of the time. In comparison, RLHF is
of confidence scores. preferred over SFT 73% of the time. While RLHF
We use PaLM 2 as our LLM for labeling pref- slightly outperforms RLAIF, the difference is not
erences (Google et al., 2023). Unless otherwise statistically significant4 . We also directly compare
specified, we use the Large model size with a max- the win rate of RLAIF vs. RLHF and find that they
imum context length of 4096 tokens. For chain-of- are equally preferred - i.e. the win rate is 50%. To
thought generation, we set a maximum decoding better understand how RLAIF compares to RLHF,
length of 512 tokens and sample with temperature we qualitatively compare summaries generated by
T = 0 (i.e. greedy decoding). For self-consistency both policies in Section 6.
experiments, we use temperature T = 1 with top-K We also compare RLAIF and RLHF summaries
sampling (Fan et al., 2018), where K = 40. vs. human-written reference summaries. RLAIF
summaries are preferred over the reference sum-
4.3 Model Training maries 79% of the time, and RLHF are preferred
We train a SFT model on OpenAI’s filtered TL;DR over the reference summaries 80% of the time. The
dataset, using PaLM 2 Extra-Small (XS) as our difference in win rates between RLAIF and RLHF
initial checkpoint. vs. the reference summaries is also not statistically
We then initialize our RMs from the SFT model significant.
and train them on OpenAI’s TL;DR human pref- One confounding factor in our results is that our
erence dataset. For the results in Section Table 1 RLAIF and RLHF policies tend to generate longer
and 5.1, we generate AI-labeled preferences us- summaries than the SFT policy, which can account
ing PaLM 2 L, using the "OpenAI + COT 0-shot" for some of the quality improvements. Similar to
prompt (see Section 5.2) without self consistency, Stiennon et al. (2020), we conduct post-hoc analy-
and then train the RM on the full preference dataset. sis that suggests that while both RLAIF and RLHF
policies benefit from producing longer summaries,
For reinforcement learning, we train the policy
both still outperform the SFT policy by a similar
with Advantage Actor Critic (A2C) as described
margin after controlling for length. Full details in
in Appendix B. Both policy and value models are
Appendix D.
initialized from the SFT model. We rollout our
policies using the filtered Reddit TL;DR dataset as These results suggest that RLAIF is a viable
our initial states. alternative to RLHF that does not depend on human
annotation. To understand how well these findings
For more training details, see Appendix C.
generalize to other NLP tasks, experiments on a
4.4 Human Evaluation broader range of tasks are required, which we leave
to future work.
We collected 1200 ratings from humans to evalu-
ate RLHF and RLAIF policies. For each rating 5.2 Prompting Techniques
task, the evaluator was presented with a post and 4 We experiment with three types of prompting tech-
summaries generated from different policies (one niques - preamble specificity, chain-of-thought rea-
from each of RLAIF, RLHF, SFT, and Human Ref- soning, and few-shot in-context learning - and re-
erence) and asked to rank them in order of quality port the results in Table 2.
without ties. Posts were drawn from the held-out Using the detailed OpenAI preamble improves
set of the TL;DR supervised fine-tuning dataset, alignment +1.3% (77.4% "OpenAI 0-shot" vs.
which was not used in any other evaluation. Once 76.1% "Base 0-shot"), and chain-of-thought rea-
these rankings were collected, it was possible to soning improves alignment +1.4% (77.5% "Base +
calculate win rates with respect to any two policies.
4
When conducting a two-sample t-test, p-value = 0.25,
3
This follows the evaluation procedure in Stiennon et al. which is greater than 0.05. Therefore, we cannot reject the
(2020) null hypothesis.
Figure 4: Example summaries generated by SFT, RLHF and RLAIF policies for a Reddit post. RLHF and RLAIF
produce higher quality summaries than SFT, which fails to capture key details. Salient details are bolded.

Prompt AI Labeler Alignment yield improvements, we experimented with an 8-


Base 0-shot 76.1% shot prompt and found that accuracy decreased by
Base 1-shot 76.0% -7.6% (69.8% "OpenAI 8-shot" vs. 77.4% "OpenAI
Base 2-shot 75.7% 0-shot"). We verified that all examples used in
Base + COT 0-shot 77.5% this experiment fit within our AI labeler’s context
OpenAI 0-shot 77.4% length.
OpenAI 1-shot 76.2%
Overall, we observe that the optimal configura-
OpenAI 2-shot 76.3%
tion employs a detailed preamble, chain-of-thought
OpenAI 8-shot 69.8%
reasoning, and no in-context learning ("OpenAI +
OpenAI + COT 0-shot 78.0%
COT 0-shot"). This combination achieves an AI La-
OpenAI + COT 1-shot 77.4%
OpenAI + COT 2-shot 76.8% beler Alignment of 78.0%, which is +1.9% higher
than using our most basic prompt ("Base 0-shot").
Table 2: We observe that prompting with the detailed As a point of comparison, Stiennon et al. (2020) es-
OpenAI preamble and eliciting chain-of-thought reason- timated that human inter-annotator agreement was
ing gives the highest AI Labeler Alignment. In-context 73-77% on the human preference dataset, suggest-
learning does not improve accuracy, and possibly even ing that our LLM performs rather well. We use
makes it worse.
the "OpenAI + COT 0-shot" prompt for all other
experiments.
COT 0-shot" vs. 76.1% "Base 0-shot"). Though the 5.3 Self-Consistency
improvement from combining the two techniques
does not match the sum of their individual gains, Self-Consistency AI Labeler Alignment
the techniques are still complementary, together 1 sample, T=0 78.0%
yielding +1.9% improvement. 4 samples, T=1 72.6%
We observe that few-shot in-context learning 16 samples, T=1 72.8%
does not improve alignment, even potentially de-
grading it. For "OpenAI + COT k-shot" prompts, Table 3: Sampling several chain-of-thought rationales
with T = 1 results in lower alignment with human pref-
we see accuracy monotonically decrease as k in-
erences. Note: 1, 4, and 16 samples represent 2, 8, and
creases from 0 to 2. One hypothesis is that the LLM 32 inferences given our position de-biasing technique
is able to generate more useful chain-of-thought (see Section 3.1.1).
rationales on its own than when it follows the ratio-
nales given in our 1-shot and 2-shot exemplars (see We experiment with self-consistency using 4 and
Table 9 for examples). 16 samples with decoding temperature of 1 as de-
To understand if adding more exemplars might scribed in Section 3.1.3, and both settings show
drops in alignment of greater than -5% versus not
using self-consistency. Manually inspecting chain-
of-thought rationales did not reveal any common
patterns for why self-consistency might result in
lower accuracy (see examples in Table 10).
One hypothesis for the degradation in accuracy
is that using a temperature of 1 leads the model to
generate lower quality chain-of-thought rationales
compared to greedy decoding, ultimately leading
to worse accuracy overall. Using a temperature
between 0 and 1 may yield better results.
Figure 5: RM accuracy on a held-out set of human
5.4 Size of LLM Labeler
preferences increases rapidly as more preference pairs
are used in training. After training on a few thousand
Model Size AI Labeler Alignment examples, performance is close to training on the full
PaLM 2 XS 62.7% dataset. The x-axis is in log-scale.
PaLM 2 S 73.8%
PaLM 2 L 78.0%
amounts of training examples by randomly sub-
Table 4: AI Labeler Alignment increases as the size of sampling the full preference datasets. Results are
the LLM labeler increases. displayed in Figure 5.
We observe that the performance of the AI pref-
Large model sizes are not widely accessible and erence RM quickly plateaus after training on a few
can be slow and expensive to run. We experiment thousand examples. The RM achieves ∼60% ac-
with labeling preferences with different model sizes curacy when training on only 128 examples and
and observe a strong relationship between align- then reaches an accuracy close to that of training
ment and size. Alignment drops -4.2% when mov- on the full dataset when training with only 5,000
1
ing from PaLM 2 Large (L) down to PaLM 2 Small examples (roughly 20 of the full dataset).
(S), and it drops another -11.1% when moving We also conduct a parallel set of experiments
down to PaLM 2 XS. This trend is consistent with on a RM trained on human preferences. We find
scaling laws observed in other work (Kaplan et al., that the human and AI RMs follow similar scaling
2020). One contributing factor to the decline in curves. One difference is that the human preference
performance could be the increase in position bias RM appears to continually improve as the number
in smaller LLMs (see Appendix A). of training examples increases, though more train-
On the end of this trend, these results also sug- ing examples only bring small improvements to
gest that scaling up AI labeler size may produce accuracy. This trend suggests that RMs trained on
even higher quality preference labels. Since the AI AI preferences may not benefit as much from scal-
labeler is only used to generate preference exam- ing up the number of training examples as RMs
ples once and is not queried during RL training, trained on human preferences.
using an even larger AI labeler is not necessarily Given the limited improvement from scaling up
prohibitively expensive. Furthermore, Section 5.5 the number of AI preference examples, more re-
suggests that a small number of examples may be sources may be better spent on labeling with larger
sufficient to train a powerful RM (e.g. on the or- model sizes (see Section 5.4) rather than labeling
der of O(1k)), further reducing the costs of using a more preference examples.
larger labeler model.
6 Qualitative Analysis
5.5 Number of Preference Examples To better understand how RLAIF compares to
To understand how RM accuracy changes with the RLHF, we manually inspected summaries gener-
number of training examples, we train a RM on ated by both policies. In many cases, the two poli-
varying amounts of AI-labeled preference exam- cies produced similar summaries, which is reflected
ples and evaluate Pairwise Accuracy on a held- in their similar win rates (see Section 5.1). How-
out set of human preferences. We obtain different ever, we identified two patterns where they fre-
quently diverged. niques for generating rewards from LLMs (Roit
One pattern we observed is that RLAIF appears et al., 2023; Kwon et al., 2022; Yang et al., 2023).
less likely to hallucinate than RLHF. The hallucina- These works demonstrate that LLMs can generate
tions in RLHF are often plausible but are inconsis- useful signals for RL fine-tuning, which inspired
tent with the original text. For instance, in Example this work’s investigation into whether LLMs can
#1 of Table 11, the RLHF summary states that the serve as a viable alternative to humans in collecting
author is 20 years old, but this is not mentioned or preference labels for reinforcement learning.
implied by the original text.
Another pattern we observed is that RLAIF 8 Conclusion
sometimes produces less coherent or grammatical In this work, we show that RLAIF can produce
summaries than RLHF. For instance, in Example #1 comparable improvements to RLHF without de-
of Table 12, the RLAIF summary produces run-on pending on human annotators. Our experiments
sentences. show that RLAIF greatly improves upon a SFT
Overall, while we observe certain tendencies for baseline, and the margin of improvement is on par
each policy, both produce high-quality summaries with that of RLHF. In head-to-head comparisons,
that are relatively similar. RLAIF and RLHF are preferred at similar rates by
humans. We also study various AI labeling tech-
7 Related Work niques and conduct scaling studies to understand
LLMs (Brown et al., 2020; Thoppilan et al., 2022; the optimal settings for generating aligned prefer-
Chowdhery et al., 2022; Google et al., 2023; Ope- ences.
nAI, 2023) have shown impressive performance While this work highlights the potential of
over a wide range of NLP tasks. For several of RLAIF, we note some limitations of these findings.
these tasks, RL has emerged as an effective opti- First, this work only explores the task of summa-
mization technique. While initial applications of rization, leaving an open question about generaliz-
RL on tasks such as translation (Wu et al., 2016, ability to other tasks. Second, we did not estimate
2018) and summarization (Gao et al., 2019; Wu whether LLM inference is advantageous versus hu-
and Hu, 2018) used automatic evaluation metrics man labeling in terms of monetary costs. Addition-
as rewards, such simplified formulations of rewards ally, there remain many interesting open questions,
did not fully align with human notions of quality. such as whether RLHF combined with RLAIF can
Reinforcement learning from human feedback outperform a single approach alone, how well us-
Christiano et al. (2017) has been used as a tech- ing a LLM to directly assign rewards performs,
nique to directly align LLMs with human prefer- whether improving AI Labeler Alignment trans-
ences (Ziegler et al., 2019) by training a reward lates to improved final policies, and whether using a
model on pairwise comparisons of natural language LLM labeler the same size as the policy model can
responses, and has been successfully applied for further improve the policy (i.e. whether a model
summarization (Stiennon et al., 2020), general- can "self-improve"). We leave these questions for
ized instruction following (Ouyang et al., 2022; future work.
Lai et al., 2023), dialogue (Gilardi et al., 2023; We hope that this paper motivates further re-
Manyika, 2023; Glaese et al., 2022; Bai et al., search in the area of RLAIF.
2022a) and question answering (Nakano et al.,
Acknowledgements
2021).
LLMs have also been extensively used for data We would like to thank many people who have
generation (Wang et al., 2021; Meng et al., 2023), helped make this work complete. We thank Chen
augmentation (Feng et al., 2021) and in self- Zhu for optimizing our LLM inference setup, Le
training setups (Wang et al., 2022; Madaan et al., Hou for suggesting prompt improvements and ex-
2023). Bai et al. (2022b) introduced the idea of perimenting with self-consistency, Johan Ferret for
RL from AI feedback (RLAIF), which used LLM tips on how to tune better RL policies, and Léonard
labeled preferences in conjunction with human la- Hussenot for bringing the problem of position bias
beled preferences to jointly optimize for the two in LLMs to our attention.
conflicting objectives of helpfulness and harmless- We thank everyone who thoroughly reviewed
ness. Recent works have also explored related tech- our work and provided valuable feedback: Hakim
Sidahmed, Michal Valko, Nevan Wichers, Sian Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken
Gooding, Sushant Prakash, and Yuan Cao. Chia, Boyang Li, Shafiq Joty, and Lidong Bing. 2023.
Is GPT-3 a good data annotator? In Proceedings
Finally, we thank the individuals who designed of the 61st Annual Meeting of the Association for
and built the RL training infrastructure used in this Computational Linguistics (Volume 1: Long Papers),
paper: Léonard Hussenot, Johan Ferret, Robert pages 11173–11195, Toronto, Canada. Association
Dadashi, Geoffrey Cideron, Alexis Jacq, Sabela for Computational Linguistics.
Ramos, Piotr Stanczyk, Sertan Girgin, Danila Tom Everitt and Marcus Hutter. 2016. Avoiding wire-
Sinopalnikov, Amélie Héliou, Nikola Momchev, heading with value reinforcement learning. In Arti-
and Olivier Bachem. ficial General Intelligence: 9th International Con-
ference, AGI 2016, New York, NY, USA, July 16-19,
2016, Proceedings 9, pages 12–22. Springer.
References Angela Fan, Mike Lewis, and Yann Dauphin. 2018.
Hierarchical neural story generation. In Proceedings
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul of the 56th Annual Meeting of the Association for
Christiano, John Schulman, and Dan Mané. 2016. Computational Linguistics (Volume 1: Long Papers),
Concrete problems in ai safety. arXiv preprint pages 889–898, Melbourne, Australia. Association
arXiv:1606.06565. for Computational Linguistics.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chan-
Askell, Anna Chen, Nova DasSarma, Dawn Drain, dar, Soroush Vosoughi, Teruko Mitamura, and Ed-
Stanislav Fort, Deep Ganguli, Tom Henighan, et al. uard Hovy. 2021. A survey of data augmentation
2022a. Training a helpful and harmless assistant with approaches for NLP. In Findings of the Association
reinforcement learning from human feedback. arXiv for Computational Linguistics: ACL-IJCNLP 2021,
preprint arXiv:2204.05862. pages 968–988, Online. Association for Computa-
tional Linguistics.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu,
Amanda Askell, Jackson Kernion, Andy Jones, Anna Yang Gao, Christian M Meyer, Mohsen Mesgar, and
Chen, Anna Goldie, Azalia Mirhoseini, Cameron Iryna Gurevych. 2019. Reward learning for efficient
McKinnon, Carol Chen, Catherine Olsson, Christo- reinforcement learning in extractive document sum-
pher Olah, Danny Hernandez, Dawn Drain, Deep marisation. arXiv preprint arXiv:1907.12894.
Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez,
Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli.
Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua
2023. Chatgpt outperforms crowd-workers for text-
Landau, Kamal Ndousse, Kamile Lukosuite, Liane
annotation tasks. arXiv preprint arXiv:2303.15056.
Lovitt, Michael Sellitto, Nelson Elhage, Nicholas
Schiefer, Noemi Mercado, Nova DasSarma, Robert Amelia Glaese, Nat McAleese, Maja Tr˛ebacz, John
Lasenby, Robin Larson, Sam Ringer, Scott John- Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh,
ston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Laura Weidinger, Martin Chadwick, Phoebe Thacker,
Tamera Lanham, Timothy Telleen-Lawton, Tom Con- et al. 2022. Improving alignment of dialogue agents
erly, Tom Henighan, Tristan Hume, Samuel R. Bow- via targeted human judgements. arXiv preprint
man, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, arXiv:2209.14375.
Nicholas Joseph, Sam McCandlish, Tom Brown, and
Jared Kaplan. 2022b. Constitutional ai: Harmless- Rohan Anil Google, Andrew M. Dai, Orhan Firat,
ness from ai feedback. Melvin Johnson, Dmitry Lepikhin, Alexandre Pas-
sos, Siamak Shakeri, Emanuel Taropa, Paige Bai-
Tom Brown, Benjamin Mann, Nick Ryder, Melanie ley, Zhifeng Chen, Eric Chu, Jonathan H. Clark,
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Laurent El Shafey, Yanping Huang, Kathy Meier-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Hellstern, Gaurav Mishra, Erica Moreira, Mark
Askell, et al. 2020. Language models are few-shot Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay,
learners. Advances in neural information processing Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gus-
systems, 33:1877–1901. tavo Hernandez Abrego, Junwhan Ahn, Jacob
Austin, Paul Barham, Jan Botha, James Brad-
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, bury, Siddhartha Brahma, Kevin Brooks, Michele
Maarten Bosma, Gaurav Mishra, Adam Roberts, Catasta, Yong Cheng, Colin Cherry, Christopher A.
Paul Barham, Hyung Won Chung, Charles Sutton, Choquette-Choo, Aakanksha Chowdhery, Clément
Sebastian Gehrmann, et al. 2022. Palm: Scaling Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev,
language modeling with pathways. arXiv preprint Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad
arXiv:2204.02311. Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus
Freitag, Xavier Garcia, Sebastian Gehrmann, Lu-
Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- cas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi
tic, Shane Legg, and Dario Amodei. 2017. Deep Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jef-
reinforcement learning from human preferences. Ad- frey Hui, Jeremy Hurwitz, Michael Isard, Abe Itty-
vances in neural information processing systems, 30. cheriah, Matthew Jagielski, Wenhao Jia, Kathleen
Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi
Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley,
Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, David Silver, and Koray Kavukcuoglu. 2016. Asyn-
Hanzhao Lin, Zhongtao Liu, Frederick Liu, Mar- chronous methods for deep reinforcement learning.
cello Maggioni, Aroma Mahendru, Joshua Maynez, CoRR, abs/1602.01783.
Vedant Misra, Maysam Moussalem, Zachary Nado,
John Nham, Eric Ni, Andrew Nystrom, Alicia Par- Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu,
rish, Marie Pellat, Martin Polacek, Alex Polozov, Long Ouyang, Christina Kim, Christopher Hesse,
Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Shantanu Jain, Vineet Kosaraju, William Saunders,
Parker Riley, Alex Castro Ros, Aurko Roy, Brennan et al. 2021. Webgpt: Browser-assisted question-
Saeta, Rajkumar Samuel, Renee Shelby, Ambrose answering with human feedback. arXiv preprint
Slone, Daniel Smilkov, David R. So, Daniel Sohn, arXiv:2112.09332.
Simon Tokumine, Dasha Valter, Vijay Vasudevan, Ki-
ran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui OpenAI. 2023. Gpt-4 technical report.
Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. 2022. Training language models to follow instruc-
2023. Palm 2 technical report. tions with human feedback. Advances in Neural
Sham M Kakade. 2001. A natural policy gradient. Ad- Information Processing Systems, 35:27730–27744.
vances in neural information processing systems, 14.
Pouya Pezeshkpour and Estevam Hruschka. 2023.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Large language models sensitivity to the order of
Brown, Benjamin Chess, Rewon Child, Scott Gray, options in multiple-choice questions. arXiv preprint
Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. arXiv:2308.11483.
Scaling laws for neural language models. arXiv
preprint arXiv:2001.08361. Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Ge-
offrey Cideron, Robert Dadashi, Matthieu Geist, Ser-
Minae Kwon, Sang Michael Xie, Kalesha Bullard, and tan Girgin, Léonard Hussenot, Orgad Keller, et al.
Dorsa Sadigh. 2022. Reward design with language 2023. Factually consistent summarization via rein-
models. In The Eleventh International Conference forcement learning with textual entailment feedback.
on Learning Representations. arXiv preprint arXiv:2306.00186.

Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo, John Schulman, Filip Wolski, Prafulla Dhariwal,
Thuat Nguyen, Franck Dernoncourt, Ryan A Rossi, Alec Radford, and Oleg Klimov. 2017. Proxi-
and Thien Huu Nguyen. 2023. Okapi: Instruction- mal policy optimization algorithms. arXiv preprint
tuned large language models in multiple languages arXiv:1707.06347.
with reinforcement learning from human feedback.
arXiv preprint arXiv:2307.16039. Noam Shazeer and Mitchell Stern. 2018. Adafactor:
Adaptive learning rates with sublinear memory cost.
Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, CoRR, abs/1804.04235.
Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li,
Mengshen He, Zhengliang Liu, et al. 2023. Summary Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel
of chatgpt/gpt-4 research and perspective towards Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford,
the future of large language models. arXiv preprint Dario Amodei, and Paul F Christiano. 2020. Learn-
arXiv:2304.01852. ing to summarize with human feedback. Advances
in Neural Information Processing Systems, 33:3008–
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler 3021.
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Richard S Sutton, David McAllester, Satinder Singh,
et al. 2023. Self-refine: Iterative refinement with and Yishay Mansour. 1999. Policy gradient methods
self-feedback. arXiv preprint arXiv:2303.17651. for reinforcement learning with function approxima-
tion. Advances in neural information processing
James Manyika. 2023. An overview of systems, 12.
bard: an early experiment with genera-
tive ai. https://ai.google/static/ Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam
documents/google-about-bard.pdf. Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,
Accessed: 2023-08-23. Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al.
2022. Lamda: Language models for dialog applica-
Yu Meng, Martin Michalski, Jiaxin Huang, Yu Zhang, tions. arXiv preprint arXiv:2201.08239.
Tarek Abdelzaher, and Jiawei Han. 2023. Tun-
ing language models as training data generators for Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le,
augmentation-enhanced few-shot learning. In Inter- Ed H Chi, Sharan Narang, Aakanksha Chowdhery,
national Conference on Machine Learning, pages and Denny Zhou. 2022. Self-consistency improves
24457–24477. PMLR. chain of thought reasoning in language models. In
The Eleventh International Conference on Learning candidates are shown. For each example in our AI
Representations. labeling evaluation set, we query the LLM prefer-
Zirui Wang, Adams Wei Yu, Orhan Firat, and Yuan Cao. ences for the pair of candidates, swap the order in
2021. Towards zero-label language learning. arXiv which candidates are presented, and then query the
preprint arXiv:2109.09193. LLM preferences again.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten We consider an LLM to be more biased if it
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, prefers the same position on both the original and
et al. 2022. Chain-of-thought prompting elicits rea- reversed inferences. For example, let candidates
soning in large language models. Advances in Neural A and B be in positions 1 and 2 for the first infer-
Information Processing Systems, 35:24824–24837.
ence and then in positions 2 and 1 for the second,
Ronald J Williams. 1992. Simple statistical gradient- respectively. If the LLM prefers the same posi-
following algorithms for connectionist reinforcement tion on both inferences, we consider the LLM to
learning. Machine learning, 8(3):229–256.
be position-biased. We measure position bias by
Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie- computing "% Same Position Preferred" - the per-
Yan Liu. 2018. A study of reinforcement learning centage of inference pairs where this occurs, and a
for neural machine translation. In Proceedings of the higher metric value indicates a more biased LLM.
2018 Conference on Empirical Methods in Natural
Language Processing, pages 3612–3621. We find that PaLM 2 L, S, and XS prefer the
same position 18%, 21%, and 56% of the time, re-
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, spectively (see Table 5), suggesting that position
Mohammad Norouzi, Wolfgang Macherey, Maxim
Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. bias is inversely proportional to model size. One hy-
2016. Google’s neural machine translation system: pothesis is that larger models are more capable and
Bridging the gap between human and machine trans- therefore more faithfully judge preferences based
lation. arXiv preprint arXiv:1609.08144. on the content of the candidates rather than their
Yuxiang Wu and Baotian Hu. 2018. Learning to extract positions, which are supposed to be immaterial.
coherent summary via deep reinforcement learning. We also observe that for PaLM 2 L, of the 18%
In Proceedings of the AAAI Conference on Artificial of cases where it prefers the same position on both
Intelligence, page 5602.
inferences, 94% of the time it prefers the first can-
Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, didate shown. On the other hand, PaLM 2 S and
and Yuandong Tian. 2023. Rlcd: Reinforcement XS show affinity for the second candidate shown,
learning from contrast distillation for language model
preferring it 91% and 99% of the time, respectively,
alignment.
when the same position is preferred on both in-
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B ferences. These biases are statistically significant
Brown, Alec Radford, Dario Amodei, Paul Chris- under a two-sided binomial test at α = 0.05.
tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
guage models from human preferences. arXiv
preprint arXiv:1909.08593. B A2C for Language Models
Consider a generic MDP (X , A, R, P, γ). At each
A Position Bias in LLM Labelers step t, given the current state Xt ∈ X and the
next action At ∈ A, the model receives a reward
Model Size % Same Position Preferred Rt = R(Xt , At ) and transitions to the next state
PaLM 2 L 18% Xt+1 = (Xt , At ).
PaLM 2 S 21% In the context of language models, Xt is the con-
PaLM 2 XS 56% catenation of the input text and all text the policy
has generated up to time t. Action At is the token
Table 5: Position bias is more prevalent in smaller model
sizes, as indicated by "% Same Position Preferred", decoded at time t by the stochastic policy πθ (·|Xt )
which measures the percentage of examples where the from the considered vocabulary, where θ represents
LLM prefers the same position even after swapping the the policy parameters. Finally, the reward Rt is
order of candidates. Analysis is conducted using the given by the RM. The RM is only evaluated when
"OpenAI + COT 0-shot" prompt. the language model response has been fully gener-
ated, and therefore all rewards before the last token
Our analysis suggests that the LLMs used for are 0 while the reward corresponding to the final
preference labeling are biased by the order in which token is RTlast .
The cumulative sum of rewards received when
following the policy π from a state-action pair
(Xt = x, At = a) is called Pthe return. Gener-
π = Tlast s−t
ally, it is defined as Zx,a s=t γ Rs . How-
ever, since only the terminal reward is non-zero
and we use γ = 1, the return can be simplified to
π =R
Zx,a Tlast .
Given a trajectory (Xt , At , Rt )Tt=0 last
generated
under πθ , the Advantage Actor Critic estimator is
defined as follows:
X   Figure 6: After controlling for summary length, RLAIF
LA2C = log πθ (At |Xt ) RTlast − Vψπ (Xt ) and RLHF policies both still outperform the baseline
t≥0 SFT policy and achieve similar win rate.

where the bar notation denotes that no gradient


is passed through the advantage term during the We take an approach similar to Stiennon et al.
policy training phase. (2020). For each of our RL policies, we train a
logistic regression model where the input is the ra-
The baseline value function Vψπ (x) estimates the
tio of the RL summary length to the SFT summary
return-to-go RTlast when following the policy πθ
length (in characters) and the target is a binary label
and is parametrized by ψ (Williams, 1992; Sutton
indicating whether RL was preferred to SFT. After
et al., 1999). It is trained with the following loss:
fitting the model, we estimate a length-controlled
X win rate by asking the logistic regressor to predict
Lbaseline = (RTlast − Vψπ (Xt ))2
t
the win rate given a length ratio of 1.0, which rep-
resents the scenario where both the RL and SFT
C Model Training Details summaries are of equal length.
After controlling for length, our estimated win
We train a SFT model with a batch size of 128 for rates for RLAIF and RLHF vs. SFT are 59% and
a single epoch. We use the Adafactor (Shazeer and 61%, respectively (see Figure 6). Both RL policies
Stern, 2018) optimizer with a learning rate of 10−5 , continue to outperform the SFT policy by a simi-
and we set a maximum input and output length of lar margin, supporting our initial conclusion that
1024 and 128 tokens, respectively. RLAIF is comparable to RLHF.
Each RM is trained for 3 epochs, which is when We note that this post-hoc method of control-
loss and accuracy curves plateau. We use the ling for length is imperfect, as it assumes the lo-
Adafactor optimizer with a learning rate of 10−5 gistic regression model can accurately learn the
and a batch size of 128. relationship between summary length and human
For reinforcement learning, we sample from our preference. A more principled approach would be
language model policies with a temperature of T = to have all policies generate summaries of similar
0.9 to encourage exploration. We train with a batch length (e.g. by encouraging policies to generate
size of 128 and learning rate of 10−5 for 8 epochs, summaries of a fixed length during optimization).
resulting in ∼1 million episodes. We set β = 0.05
for the KL divergence loss.

D Controlling for Summary Length


Our RLAIF and RLHF policies tend to generate
longer summaries than the baseline SFT policy. For
the summaries sent to human evaluation, the mean
character-length of summaries produced by RLAIF,
RLHF, and SFT policies were 164, 161, and 132,
respectively. We conduct post-hoc analysis to esti-
mate the win rates of RLAIF and RLHF vs. SFT
after controlling for length.
"Base" preamble You are an expert summary rater. Given a piece of text and
two of its possible summaries, output 1 or 2 to indicate
which summary is better.

"OpenAI" preamble A good summary is a shorter piece of text that has the
essence of the original. It tries to accomplish the same
purpose and conveys the key information from the original
post. Below we define four evaluation axes for summary
quality: coherence, accuracy, coverage, and overall quality.

Coherence: This axis answers the question “how coherent is


the summary on its own?” A summary is coherent if it’s easy
to understand when read on its own and free of English errors.
A summary is not coherent if it’s difficult to understand
what the summary is trying to say. Generally, it’s more
important that the summary is understandable than it being
free of grammar errors.

Accuracy: This axis answers the question “does the factual


information in the summary accurately match the post?” A
summary is accurate if it doesn’t say things that aren’t in
the article, it doesn’t mix up people, and generally is not
misleading.

Coverage: This axis answers the question “how well does


the summary cover the important information in the post?” A
summary has good coverage if it mentions the main information
from the post that’s important to understand the situation
described in the post. A summary has poor coverage if
someone reading only the summary would be missing several
important pieces of information about the situation in the
post. A summary with good coverage should also match the
purpose of the original post (e.g. to ask for advice).

Overall quality: This axis answers the question “how good


is the summary overall at representing the post?” This can
encompass all of the above axes of quality, as well as others
you feel are important. If it’s hard to find ways to make
the summary better, the overall quality is good. If there
are lots of different ways the summary can be made better,
the overall quality is bad.

You are an expert summary rater. Given a piece of text and


two of its possible summaries, output 1 or 2 to indicate
which summary best adheres to coherence, accuracy, coverage,
and overall quality as defined above.

Table 6: The "Base" and "OpenAI" preambles given to the LLM labeler to obtain preference labels.

Preamble A good summary is a shorter piece of text that has the


essence of the original. ... Given a piece of text and
two of its possible summaries, explain which summary best
adheres to coherence, accuracy, coverage, and overall quality
as defined above.

Sample to Annotate Text - {text}


Summary 1 - {summary1}
Summary 2 - {summary2}

Ending Consider the coherence, accuracy, coverage, and overall


quality of each summary and explain which one is better.

Rationale:

Table 7: The template used for the "OpenAI + COT 0-shot" prompt, with some text removed for brevity. For COT
prompts, we first decode a response from the LLM and then concatenate it with the original prompt and the ending
"Preferred Summary=" before following the scoring procedure in Section 3.1 to obtain a preference distribution.
Preamble A good summary is a shorter piece of text that has the
essence of the original. ... Given a piece of text and
two of its possible summaries, explain which summary best
adheres to coherence, accuracy, coverage, and overall quality
as defined above.

1-shot Exemplar »»»» Example »»»»

Text - We were best friends over 4 years ...


Summary 1 - Broke up with best friend, should I wish her a
happy birthday... And what do you think of no contact?
Summary 2 - should I wish my ex happy birthday, I broke no
contact, I’m trying to be more patient, I’m too needy, and I
don’t want her to think I’ll keep being that guy.

Thoughts on Summary 1 -
Coherence - 7. Rationale: The summary is generally
understandable, though it could be written with better
grammar.
Accuracy - 9. Rationale: The summary doesn’t say things
that aren’t in the original text, and isn’t misleading.
Coverage - 6. Rationale: The summary covers most of the
important information in the post and conveys the gist of
the original text. However, it places more emphasis on "no
contact" and could have mentioned the smothering/neediness to
be more complete.
Overall Quality - 7. Rationale: The summary represents
the post fairly well with only minor areas where it could be
improved.

Thoughts on Summary 2 -
Coherence - 3. Rationale: The summary is long-winded and
has several grammatical errors.
Accuracy - 4. Rationale: The summary mentions that the
author broke no contact, but this is incorrect. Otherwise,
it is accurate.
Coverage - 8. Rationale: The summary covers the key points
in the original text.
Overall Quality - 4. Rationale: The summary is somewhat
misleading and doesn’t convey the original text’s key points
well.

Preferred Summary=1

»»»» Follow the instructions and the example(s) above »»»»

Sample to Annotate Text - {text}


Summary 1 - {summary1}
Summary 2 - {summary2}

Ending Thoughts on Summary 1 -

Table 8: The template used for the "OpenAI + COT 1-shot" prompt, with some text removed for brevity.
Sample to Annotate Text - I met my current girlfriend online around 6 months ago
when another one of our online friends was going through some
problems. ...

However, I feel incredibly insecure as she has yet to begin


college. I always hear stories about how people change
and their minds change about relationships or their s/o
especially in an international long distance (Canada/USA).
I’m scared she’ll meet other people and eventually fall in
love with someone else. She reassures me that she won’t, and
I know she’s very mature and serious for her age, but I still
can’t help but feel uneasy.

Summary 1 - I have an amazing relationship with my girlfriend


but feel incredibly insecure with her due to lack of college
graduation and apprehensions about other people finding her

Summary 2 - I’m very insecure about my current girlfriend


because she hasn’t begun college yet and I’m worried she’ll
meet someone else and fall in love with someone else. Am I
being ridiculous?

COT 0-shot Rationale Rationale: Summary 2 is better.

Summary 1 is not coherent because it’s difficult to


understand what the summary is trying to say. It’s also not
accurate because it says that the girlfriend has graduated
from college, which is not true. Summary 2 is coherent and
accurate. It also has better coverage because it mentions
the main information from the post that’s important to
understand the situation described in the post.

COT 1-shot Rationale Thoughts on Summary 1 -


Coherence - 7. Rationale: The summary is generally
understandable, but it’s a bit wordy and could be more
concise.
Accuracy - 8. Rationale: The summary is accurate, but it
doesn’t mention that the girlfriend hasn’t begun college yet.
Coverage - 7. Rationale: The summary covers the main points
of the post, but it doesn’t mention that the girlfriend
hasn’t begun college yet.
Overall Quality - 7. Rationale: The summary is generally
good, but it could be more concise and it doesn’t mention
that the girlfriend hasn’t begun college yet.

Thoughts on Summary 2 -
Coherence - 9. Rationale: The summary is concise and easy
to understand.
Accuracy - 9. Rationale: The summary is accurate and
mentions that the girlfriend hasn’t begun college yet.
Coverage - 9. Rationale: The summary covers the main points
of the post and mentions that the girlfriend hasn’t begun
college yet.
Overall Quality - 9. Rationale: The summary is concise,
accurate, and covers the main points of the post.

Table 9: An example of the different chain-of-thought rationales produced by the 0-shot ("OpenAI + COT 0-shot")
vs. 1-shot ("OpenAI + COT 1-shot") prompts.
Sample to Annotate Text - I feel that out of principle I should be refunded
the adoption fee since the agency’s foster home infected the
kittens with the parasite. Both cats were born in the foster
home and there are 20 other cats. Do I have any legal right
to ask for the fee back? Or help with the cost of treating?
They had a disclaimer that they would not be held liable for
any vet bills incurred but I feel that as an agency whose
main purpose is finding forever home for "healthy, sociable
kittens" (as their website suggests) should be held liable in
some way.

Any thoughts or information would be greatly appreciated as


I am not sure about the rules or laws dealing with adoption
agencies.

Summary 1 - Adopted kittens from an agency who said it would


not be held liable for any vet bills. Two days later they
are diagnosed with Giardia. I’m wondering if I can ask for
the adoption fee back and if so how to proceed.

Summary 2 - Adopted kittens have parasite. Want to know if


I can get reimbursed from the agency.

Rationale @ T = 0 Rationale: Summary 1 is better. It is coherent, accurate,


and has good coverage. It also has good overall quality.
Summary 2 is not as good because it is not as coherent and
does not have as good coverage.

Rationale @ T = 1 Rationale: Summary 2 is more coherent and accurate. It also


has better coverage, as it includes the information that the
cats were diagnosed with Giardia.

Table 10: An example comparing chain-of-thought rationales produced at different temperatures for self-consistency
experiments.

Example #1 RLAIF summary: I have anxiety and I think I have a heart


problem. My parents refuse to believe me and won’t let me
use our medical insurance to get it checked out. What should
I do?

RLHF summary: I have anxiety and I think I have a heart


problem but my parents won’t let me use our medical insurance
to get it checked out. What should I do? I’m 20 years old.
..please help

Example #2 RLAIF summary: I slept through my 8:30 class on Rosh


Hashanah, and decided to go back to sleep until 1:00 PM.

RLHF summary: I slept through my 8:30 class on Rosh Hashanah,


and decided to go back to bed until 1:00 PM. I’m a lazy f***.

Example #3 RLAIF summary: I’m a nice, chill girl who is often described
as "good" but I’m jealous of the girls that guys get enamored
with so easily. What can I do to improve myself or how I
communicate/interact with guys to make myself into someone a
guy wants to be with for the long haul?

RLHF summary: I’m a nice, chill girl who is often described


as "boring". I’m jealous of the girls that guys get enamored
with so easily. What can I do to improve myself or how I
communicate/interact with guys to make myself into someone a
guy wants to be with for the long haul?

Table 11: We observe that the RLHF policy tends to hallucinate more frequently than the RLAIF policy. Hallucina-
tions are highlighted in red.
Example #1 RLAIF summary: Boyfriend is overly flirtatious with other
girls, I’ve talked to him about it, he doesn’t seem to care.
It’s causing trust issues. Am I overreacting? What else can
I do?

RLHF summary: Boyfriend is overly flirty with other girls


and it makes me uncomfortable. I’ve talked to him about it
multiple times, but he doesn’t seem to care. What else can I
do? Am I overreacting?

Example #2 RLAIF summary: Asked a girl to prom, things were going great
until I asked her. Now our conversations are awkward and I’m
not sure if I should ask her out. Should I just give up?

RLHF summary: I asked a girl to prom and now our


conversations are awkward. I like her and I’m wondering if
I should ask her out or just give up. I’m not sure

Example #3 RLAIF summary: Broke up with girl 1, she started dating my


best friend. Dated girl 2, she became unstable and we broke
up. I still feel attached to both. Why? How do I get over
this? Any advice?

RLHF summary: I broke up with a girl I was with for years,


and she started dating my best friend. I started dating a
new girl, and she became depressed and distant. I still feel
attached to both. Why?

Table 12: Another pattern identified through manually inspected summaries is that summaries from the RLAIF
policy tend to be less coherent and grammatical than summaries from the RLHF policy. Less coherent phrases are
highlighted in red.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy