RLAIF: Scaling Reinforcement Learning From Human Feedback With AI Feedback
RLAIF: Scaling Reinforcement Learning From Human Feedback With AI Feedback
Feedback
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard,
Colton Bishop, Victor Carbune, Abhinav Rastogi
Google Research
{harrisonlee,samratph,hassan}@google.com
Abstract
1 Introduction
technique called "Reinforcement Learning from
Reinforcement Learning from Human Feedback AI Feedback" (RLAIF)1 . While they showed that
(RLHF) is an effective technique for aligning lan- utilizing a hybrid of human and AI preferences
guage models to human preferences (Stiennon in conjunction with the "Constitutional AI" self-
et al., 2020; Ouyang et al., 2022) and is cited as revision technique outperforms a supervised fine-
one of the key drivers of success in modern conver- tuned baseline, their work did not directly compare
sational language models like ChatGPT and Bard the efficacy of human vs. AI feedback, leaving
(Liu et al., 2023; Manyika, 2023). By training the question unanswered whether RLAIF can be a
with reinforcement learning (RL), language mod- suitable alternative to RLHF.
els can be optimized on complex, sequence-level In this work, we directly compare RLAIF against
objectives that are not easily differentiable with RLHF on the task of summarization. Given a text
traditional supervised fine-tuning. and two candidate responses, we assign a prefer-
The need for high-quality human labels is an ence label using an off-the-shelf LLM. We then
obstacle for scaling up RLHF, and one natural train a reward model (RM) on the LLM prefer-
question is whether artificially generated labels can ences with a contrastive loss. Finally, we fine-tune
achieve comparable results. Several works have a policy model with reinforcement learning, using
shown that large language models (LLMs) exhibit 1
We use "RLAIF" to denote training a reward model on AI-
a high degree of alignment with human judgment - labeled preferences followed by conducting RL fine-tuning.
even outperforming humans on some tasks (Gilardi This is distinct from "Constitutional AI", which improves
et al., 2023; Ding et al., 2023). Bai et al. (2022b) upon a supervised learning model through iteratively asking an
LLM to generate better responses according to a constitution.
was the first to explore using AI preferences to Both were introduced in Bai et al. (2022b) and are sometimes
train a reward model used for RL fine-tuning - a confused for one another.
Figure 2: A diagram depicting RLAIF (top) vs. RLHF (bottom)
• We demonstrate that RLAIF achieves com- Lr (ϕ) = −E [log σ(rϕ (x, yw ) − rϕ (x, yl ))]
(x,yw ,yl )∼D
parable performance to RLHF on the task of
summarization where σ is the sigmoid function.
Preamble A good summary is a shorter piece of text that has the
essence of the original. ... Given a piece of text and two
of its possible summaries, output 1 or 2 to indicate which
summary best adheres to coherence, accuracy, coverage, and
overall quality as defined above.
Preferred Summary=1
Table 1: An example of a prompt fed to an off-the-shelf LLM to generate AI preference labels. "{text}", "{sum-
mary1}", and "{summary2}" are populated with unlabeled examples, and a preference distribution is obtained by
computing the softmax of the log probabilities of generating the tokens "1" vs. "2".
Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo, John Schulman, Filip Wolski, Prafulla Dhariwal,
Thuat Nguyen, Franck Dernoncourt, Ryan A Rossi, Alec Radford, and Oleg Klimov. 2017. Proxi-
and Thien Huu Nguyen. 2023. Okapi: Instruction- mal policy optimization algorithms. arXiv preprint
tuned large language models in multiple languages arXiv:1707.06347.
with reinforcement learning from human feedback.
arXiv preprint arXiv:2307.16039. Noam Shazeer and Mitchell Stern. 2018. Adafactor:
Adaptive learning rates with sublinear memory cost.
Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, CoRR, abs/1804.04235.
Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li,
Mengshen He, Zhengliang Liu, et al. 2023. Summary Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel
of chatgpt/gpt-4 research and perspective towards Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford,
the future of large language models. arXiv preprint Dario Amodei, and Paul F Christiano. 2020. Learn-
arXiv:2304.01852. ing to summarize with human feedback. Advances
in Neural Information Processing Systems, 33:3008–
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler 3021.
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Richard S Sutton, David McAllester, Satinder Singh,
et al. 2023. Self-refine: Iterative refinement with and Yishay Mansour. 1999. Policy gradient methods
self-feedback. arXiv preprint arXiv:2303.17651. for reinforcement learning with function approxima-
tion. Advances in neural information processing
James Manyika. 2023. An overview of systems, 12.
bard: an early experiment with genera-
tive ai. https://ai.google/static/ Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam
documents/google-about-bard.pdf. Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,
Accessed: 2023-08-23. Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al.
2022. Lamda: Language models for dialog applica-
Yu Meng, Martin Michalski, Jiaxin Huang, Yu Zhang, tions. arXiv preprint arXiv:2201.08239.
Tarek Abdelzaher, and Jiawei Han. 2023. Tun-
ing language models as training data generators for Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le,
augmentation-enhanced few-shot learning. In Inter- Ed H Chi, Sharan Narang, Aakanksha Chowdhery,
national Conference on Machine Learning, pages and Denny Zhou. 2022. Self-consistency improves
24457–24477. PMLR. chain of thought reasoning in language models. In
The Eleventh International Conference on Learning candidates are shown. For each example in our AI
Representations. labeling evaluation set, we query the LLM prefer-
Zirui Wang, Adams Wei Yu, Orhan Firat, and Yuan Cao. ences for the pair of candidates, swap the order in
2021. Towards zero-label language learning. arXiv which candidates are presented, and then query the
preprint arXiv:2109.09193. LLM preferences again.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten We consider an LLM to be more biased if it
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, prefers the same position on both the original and
et al. 2022. Chain-of-thought prompting elicits rea- reversed inferences. For example, let candidates
soning in large language models. Advances in Neural A and B be in positions 1 and 2 for the first infer-
Information Processing Systems, 35:24824–24837.
ence and then in positions 2 and 1 for the second,
Ronald J Williams. 1992. Simple statistical gradient- respectively. If the LLM prefers the same posi-
following algorithms for connectionist reinforcement tion on both inferences, we consider the LLM to
learning. Machine learning, 8(3):229–256.
be position-biased. We measure position bias by
Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie- computing "% Same Position Preferred" - the per-
Yan Liu. 2018. A study of reinforcement learning centage of inference pairs where this occurs, and a
for neural machine translation. In Proceedings of the higher metric value indicates a more biased LLM.
2018 Conference on Empirical Methods in Natural
Language Processing, pages 3612–3621. We find that PaLM 2 L, S, and XS prefer the
same position 18%, 21%, and 56% of the time, re-
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, spectively (see Table 5), suggesting that position
Mohammad Norouzi, Wolfgang Macherey, Maxim
Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. bias is inversely proportional to model size. One hy-
2016. Google’s neural machine translation system: pothesis is that larger models are more capable and
Bridging the gap between human and machine trans- therefore more faithfully judge preferences based
lation. arXiv preprint arXiv:1609.08144. on the content of the candidates rather than their
Yuxiang Wu and Baotian Hu. 2018. Learning to extract positions, which are supposed to be immaterial.
coherent summary via deep reinforcement learning. We also observe that for PaLM 2 L, of the 18%
In Proceedings of the AAAI Conference on Artificial of cases where it prefers the same position on both
Intelligence, page 5602.
inferences, 94% of the time it prefers the first can-
Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, didate shown. On the other hand, PaLM 2 S and
and Yuandong Tian. 2023. Rlcd: Reinforcement XS show affinity for the second candidate shown,
learning from contrast distillation for language model
preferring it 91% and 99% of the time, respectively,
alignment.
when the same position is preferred on both in-
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B ferences. These biases are statistically significant
Brown, Alec Radford, Dario Amodei, Paul Chris- under a two-sided binomial test at α = 0.05.
tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
guage models from human preferences. arXiv
preprint arXiv:1909.08593. B A2C for Language Models
Consider a generic MDP (X , A, R, P, γ). At each
A Position Bias in LLM Labelers step t, given the current state Xt ∈ X and the
next action At ∈ A, the model receives a reward
Model Size % Same Position Preferred Rt = R(Xt , At ) and transitions to the next state
PaLM 2 L 18% Xt+1 = (Xt , At ).
PaLM 2 S 21% In the context of language models, Xt is the con-
PaLM 2 XS 56% catenation of the input text and all text the policy
has generated up to time t. Action At is the token
Table 5: Position bias is more prevalent in smaller model
sizes, as indicated by "% Same Position Preferred", decoded at time t by the stochastic policy πθ (·|Xt )
which measures the percentage of examples where the from the considered vocabulary, where θ represents
LLM prefers the same position even after swapping the the policy parameters. Finally, the reward Rt is
order of candidates. Analysis is conducted using the given by the RM. The RM is only evaluated when
"OpenAI + COT 0-shot" prompt. the language model response has been fully gener-
ated, and therefore all rewards before the last token
Our analysis suggests that the LLMs used for are 0 while the reward corresponding to the final
preference labeling are biased by the order in which token is RTlast .
The cumulative sum of rewards received when
following the policy π from a state-action pair
(Xt = x, At = a) is called Pthe return. Gener-
π = Tlast s−t
ally, it is defined as Zx,a s=t γ Rs . How-
ever, since only the terminal reward is non-zero
and we use γ = 1, the return can be simplified to
π =R
Zx,a Tlast .
Given a trajectory (Xt , At , Rt )Tt=0 last
generated
under πθ , the Advantage Actor Critic estimator is
defined as follows:
X Figure 6: After controlling for summary length, RLAIF
LA2C = log πθ (At |Xt ) RTlast − Vψπ (Xt ) and RLHF policies both still outperform the baseline
t≥0 SFT policy and achieve similar win rate.
"OpenAI" preamble A good summary is a shorter piece of text that has the
essence of the original. It tries to accomplish the same
purpose and conveys the key information from the original
post. Below we define four evaluation axes for summary
quality: coherence, accuracy, coverage, and overall quality.
Table 6: The "Base" and "OpenAI" preambles given to the LLM labeler to obtain preference labels.
Rationale:
Table 7: The template used for the "OpenAI + COT 0-shot" prompt, with some text removed for brevity. For COT
prompts, we first decode a response from the LLM and then concatenate it with the original prompt and the ending
"Preferred Summary=" before following the scoring procedure in Section 3.1 to obtain a preference distribution.
Preamble A good summary is a shorter piece of text that has the
essence of the original. ... Given a piece of text and
two of its possible summaries, explain which summary best
adheres to coherence, accuracy, coverage, and overall quality
as defined above.
Thoughts on Summary 1 -
Coherence - 7. Rationale: The summary is generally
understandable, though it could be written with better
grammar.
Accuracy - 9. Rationale: The summary doesn’t say things
that aren’t in the original text, and isn’t misleading.
Coverage - 6. Rationale: The summary covers most of the
important information in the post and conveys the gist of
the original text. However, it places more emphasis on "no
contact" and could have mentioned the smothering/neediness to
be more complete.
Overall Quality - 7. Rationale: The summary represents
the post fairly well with only minor areas where it could be
improved.
Thoughts on Summary 2 -
Coherence - 3. Rationale: The summary is long-winded and
has several grammatical errors.
Accuracy - 4. Rationale: The summary mentions that the
author broke no contact, but this is incorrect. Otherwise,
it is accurate.
Coverage - 8. Rationale: The summary covers the key points
in the original text.
Overall Quality - 4. Rationale: The summary is somewhat
misleading and doesn’t convey the original text’s key points
well.
Preferred Summary=1
Table 8: The template used for the "OpenAI + COT 1-shot" prompt, with some text removed for brevity.
Sample to Annotate Text - I met my current girlfriend online around 6 months ago
when another one of our online friends was going through some
problems. ...
Thoughts on Summary 2 -
Coherence - 9. Rationale: The summary is concise and easy
to understand.
Accuracy - 9. Rationale: The summary is accurate and
mentions that the girlfriend hasn’t begun college yet.
Coverage - 9. Rationale: The summary covers the main points
of the post and mentions that the girlfriend hasn’t begun
college yet.
Overall Quality - 9. Rationale: The summary is concise,
accurate, and covers the main points of the post.
Table 9: An example of the different chain-of-thought rationales produced by the 0-shot ("OpenAI + COT 0-shot")
vs. 1-shot ("OpenAI + COT 1-shot") prompts.
Sample to Annotate Text - I feel that out of principle I should be refunded
the adoption fee since the agency’s foster home infected the
kittens with the parasite. Both cats were born in the foster
home and there are 20 other cats. Do I have any legal right
to ask for the fee back? Or help with the cost of treating?
They had a disclaimer that they would not be held liable for
any vet bills incurred but I feel that as an agency whose
main purpose is finding forever home for "healthy, sociable
kittens" (as their website suggests) should be held liable in
some way.
Table 10: An example comparing chain-of-thought rationales produced at different temperatures for self-consistency
experiments.
Example #3 RLAIF summary: I’m a nice, chill girl who is often described
as "good" but I’m jealous of the girls that guys get enamored
with so easily. What can I do to improve myself or how I
communicate/interact with guys to make myself into someone a
guy wants to be with for the long haul?
Table 11: We observe that the RLHF policy tends to hallucinate more frequently than the RLAIF policy. Hallucina-
tions are highlighted in red.
Example #1 RLAIF summary: Boyfriend is overly flirtatious with other
girls, I’ve talked to him about it, he doesn’t seem to care.
It’s causing trust issues. Am I overreacting? What else can
I do?
Example #2 RLAIF summary: Asked a girl to prom, things were going great
until I asked her. Now our conversations are awkward and I’m
not sure if I should ask her out. Should I just give up?
Table 12: Another pattern identified through manually inspected summaries is that summaries from the RLAIF
policy tend to be less coherent and grammatical than summaries from the RLHF policy. Less coherent phrases are
highlighted in red.