Decipher
Decipher
A BSTRACT
State-of-the-art large language models (LLMs) have become indispensable tools for various tasks.
However, training LLMs to serve as effective assistants for humans requires careful consideration.
A promising approach is reinforcement learning from human feedback (RLHF), which leverages
human feedback to update the model in accordance with human preferences and mitigate issues like
toxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largely entangled with initial
design choices that popularized the method and current research focuses on augmenting those choices
rather than fundamentally improving the framework. In this paper, we analyze RLHF through the
lens of reinforcement learning principles to develop an understanding of its fundamentals, dedicating
substantial focus to the core component of RLHF—the reward model. Our study investigates
modeling choices, caveats of function approximation, and their implications on RLHF training
algorithms, highlighting the underlying assumptions made about the expressivity of reward. Our
analysis improves the understanding of the role of reward models and methods for their training,
concurrently revealing limitations of the current methodology. We characterize these limitations,
including incorrect generalization, model misspecification, and the sparsity of feedback, along with
their impact on the performance of a language model. The discussion and analysis are substantiated
by a categorical review of current literature, serving as a reference for researchers and practitioners to
understand the challenges of RLHF and build upon existing efforts.
Contents
1 Introduction 3
2
Challenges
Challenges Evaluation Challenges
1. Misgeneralization
1. Miscalibrated 1. Accuracy 1. Misalignment
2. Averaged Preferences 2. Out-of-distribution
2. Noisy Feedback 2. Calibration
3. Sparse Feedback generalization
Model
queries RM for
Challenges reward
1. Model Class Assumption
Evaluation 2. Preference Aggregation Evaluation
1. Inter-Annotator 1. Downstream Tasks
2. Expert Agreement 2. User Response
Figure 1: Overview of the RLHF procedure, illustrating the challenges encountered at each step. The paper conducts a
detailed examination of these challenges, providing valuable insights into each stage of the procedure.
1 Introduction
Large Language Models (LLMs) demonstrate remarkable capabilities that extend beyond basic language tasks, leading
to their widespread adoption across various industries. The remarkable utility of these models holds the potential to
transform established workflows in critical sectors such as technology, healthcare, finance, and education [Singhal et al.
2022; Wu et al. 2023a; Yan et al. 2023]. As they become integral to these domains, it’s crucial to ensure that the behavior
of LLMs is predictable, safe, and trustworthy—–meeting the expectations set for a human performing the same tasks.
This challenge of making LLMs exhibit human-like qualities, known as alignment with human objectives, is central to
making these models suitable for diverse tasks. An effective method for addressing this challenge is reinforcement
learning from human feedback (RLHF).
RLHF first gained popularity due to its ability to solve reinforcement learning (RL) problems like simulated robotic
locomotion and playing Atari games [Christiano et al. 2017] without access to a reward function, by simply leveraging
human feedback about preferences on demonstrated behaviors. It has since been adopted for fine-tuning LLMs using
human feedback. This leads to a natural inquiry: How can a method designed to master games be effectively used
to align LLMs with human objectives? The method has proven to be immensely successful [OpenAI 2022], but not
without well-documented limitations [Casper et al. 2023]. A comprehensive understanding of why it achieves its success
remains largely elusive. Consequently, research efforts on the topic are stuck in a local minima, with variants focused
on augmenting the components of the method—including the training algorithm [Ramamurthy et al. 2022], reward
model [Wu et al. 2023c], and even RL-free approaches [Rafailov et al. 2023]. However, some fundamental limitations
of the approach remain obscured due to the overarching goal of recent work to refine the initial design choices.
In this work, we develop a comprehensive understanding of RLHF by analyzing the core components of the method. We
begin the study by motivating the necessity for RLHF by highlighting the problem of objective mismatch in pre-trained
LMs (Section 2). To formulate foundational questions about the framework, we adopt a Bayesian perspective of RLHF.
It serves to highlight the significance of the reward function in particular (Section 4). The reward function forms the
central cog of the RLHF procedure, and the design choices used to model it form a major focus of our study.
The current formulation of RLHF relies on a set of assumptions to model the reward function (Section 4.1, 4.2).
Following the delineation of these assumptions, an analysis of the reward model independent of specific modeling
choices follows. The analysis, in a principled manner, provides an understanding of issues such as:
1. The impractical requirement for extensive amounts of feedback data for training accurate reward models.
2. The combination of very limited feedback data and the use of function approximation results in misgeneraliza-
tion, wherein inaccurate reward values are assigned to inputs not seen during training.
These imperfections of the reward model, along with challenges such as reward sparsity and reward model misspecifica-
tion, are highlighted in the paper (Section 5.1). Their impact on the performance of a language model is explored in
detail (Section 6.2). The course of the analysis leads to the formalization of concepts such as an oracular reward that
serve as the theoretical golden standard for future efforts (Section 4.1). An overview of the RLHF procedure along with
the various challenges studied in this work is provided in Figure 1.
3
The discussion is followed by an extensive survey of an expanding body of literature related to the topic. The survey is
organized into sections that outline the framework of RLHF. Starting with a high-level overview of Large Language
Models (LLMs), the survey systematically covers various aspects:
This structure aims to provide a comprehensive overview of the extensive landscape of works that have contributed to
the remarkable success of RLHF.
Large pre-trained language models (PLMs) are massive neural networks that are trained on a huge corpus of texts
using a self-supervised learning objective. Originally utilized for representation learning [Devlin et al. 2019; Liu et al.
2019] with encoder-only models, recent research, particularly influenced by Brown et al. [2020], has shifted its focus
towards training PLMs to directly generate answers for textual problems. State-of-the-art PLMs typically employ an
auto-regressive transformer architecture [Vaswani et al. 2017] and are trained with a causal language modeling objective.
These models implicitly capture a conditional probability distribution πθ , reflecting the likelihood of sampling the next
token after observing a sequence of previous tokens. The probability of a text sequence x := (x1 , . . . , xT ), under this
QT −1
model is denoted as Pr(x; πθ ) = t=1 πθ (xt+1 | xt , . . . , x1 ). The model is trained to estimate the pre-training data
generating probability distribution over text sequences by minimizing the (forward) KL divergence between the model’s
data-generating distribution and the pre-training data distribution, denoted by Ppre-train (·).
min DKL (Ppre-train (x) || Pr(x; πθ )) = min Ex∼Ppre-train [log Ppre-train (x)] − Ex∼Ppre-train [log Pr(x; πθ )]. (1)
θ θ
The first term, representing the entropy of Ppre-train , is independent of θ and can be disregarded during optimization.
Consequently, the objective simplifies to the following cross-entropy minimization form:
min −Ex∼Ppre-train [log Pr(x; πθ )]. (2)
θ
The expectation is approximated using samples from an unsupervised pretraining text corpus D, which comprises text
sequences sampled from Ppre-train . This leads us to the following objective:
T −1
1 XX
min − log πθ (xt+1 | xt , . . . , x1 ). (3)
θ |D| t=1
x∈D
The remarkable property about PLMs lies in the contrast between the simplicity of the training recipe and the remarkable
results that they deliver [Brown et al. 2020]. Simply capturing language statistics along with scaling up the number of
trainable parameters, endows PLMs with robust semantic representations, vast commonsense knowledge, and strong
pattern-following capabilities. However, for adopting PLMs to assist humans with tasks that require an understanding
of human intentions and the ability to follow instructions, the simple training recipe of PLMs is insufficient. These
models demonstrate a shallow understanding of human intentions, often generating undesirable outputs, including
incorrect facts or conveying biased and toxic opinions.
Fundamentally, PLMs suffer from an objective mismatch problem: the training-time objective of capturing language
statistics does not necessarily align with the deployment-time objective of fulfilling a human user’s specific goals.
Eliminating this mismatch at first glance seems feasible: just train PLMs to optimize for the user objective. Unfortunately,
for many tasks, it is impossible to express the user objective as an optimization target. For example, when a user’s
objective pertains to eliciting humorous responses, establishing specific criteria for objectively evaluating the humor in
a generated response becomes an inherently challenging task.
There are currently two primary ways to deal with the problem: the behaviorist approach and the cognition-driven
approach. The behaviorist approach, implemented by supervised fine-tuning (SFT), aims to replicate observable
behaviors that humans perceive as desirable without explicit consideration of the underlying user objective. For instance,
if a user desires good summaries of articles, this approach trains a model to imitate examples of good summaries
without explicitly defining the criteria for a good summary. In contrast, the cognition-driven approach, implemented
by reinforcement learning from human feedback (RLHF), aims to uncover the underlying user objective that governs
4
the observed behaviors. It then updates the model by optimizing the uncovered objective. This approach relies on
certain assumptions—which in the case of RLHF are: (i) the user objective can bear the form of a reward function,
which can assign a numerical score to behaviors of the model, and (ii) this function can be approximated by a machine
learning model (e.g., a neural network). RLHF estimates this reward function and updates the PLM via reinforcement
learning to optimize for rewards. Regardless of the approach, the process of addressing the objective mismatch problem
is commonly referred to as the fine-tuning or alignment process. Presently, state-of-the-art language models typically
initiate this process with the behaviorist approach, followed by the cognition-driven approach.
RLHF relies on observing human feedback to deduce the (latent) user reward function. Human feedback is provided
on the outputs from a language model. RLHF assumes that there exists an underlying human reward function that
governs the feedback they provide in a particular manner, i.e., there exists some mapping from reward to actions of
a human. Suppose the reward function is being inferred by a model Rϕ parameterized by ϕ. Adopting a Bayesian
inference perspective [Korbak et al. 2022], the parameters ϕ can be viewed as a hypothesis with the dataset of human
feedback DHF as the evidence for this hypothesis. Given a prior distribution over the hypothesis Pr(ϕ), we can apply
Bayes’ rule to derive the posterior distribution over the hypotheses after observing the evidence as:
Pr(ϕ | DHF ) ∝ Pr(DHF | Rϕ ) Pr(ϕ) (4)
Reward modeling in RLHF can be seen as computing the maximum a posteriori (MAP) estimate of the parameters of a
reward model,
ϕMAP = arg max Pr(ϕ | DHF ) = arg max Pr(DHF | Rϕ ) Pr(ϕ) (5)
ϕ ϕ | {z } | {z }
(a) (b)
The first term (a) is the log-likelihood of the feedback dataset, specifying how a human’s internal objective (reward
function) governs their feedback. The second term (b) represents constraints on the hypothesis space, which is enforced
through explicit and implicit regularization techniques in neural-network training.
The presented framework raises two major questions:
1. What is the form of the likelihood function Pr(DHF | Rϕ )? In other words, how do we mathematically model
the influence of a human’s latent objective on their observable feedback?
2. What is the reinforcement learning algorithm used for optimizing the reward model? In other words, how do
we ensure the model acts consistently with its objective?
A set of answers to these questions forms the basis for an RLHF algorithm. The RLHF methodology, popularized by
Christiano et al. [2017], employs pairwise ranking feedback and uses the Bradley-Terry model [Bradley and Terry 1952]
as the likelihood function. Proximal Policy Optimization (PPO) [Schulman et al. 2017] is elected as the reinforcement
learning algorithm.
Before we move into the analysis of this method, we urge the readers to take a moment to reflect on the choices
and assumptions we have made so far to derive the general recipe of RLHF. Are there alternative choices? Can the
assumptions be relaxed or improved? Thinking critically about these foundational decisions is the key to understanding
the strengths and weaknesses of RLHF algorithms and innovating them. For example, the recently proposed direct
preference optimization (DPO) approach [Rafailov et al. 2023] replaces reinforcement learning with a reformulation of
the objective. Next, we formalize the problem setup of text generation as an agent interacting with a sequential decision
process, laying the foundation for the analysis of RLHF. We refer the reader to Section 7.6 for a detailed outline of the
RLHF procedure, and Figure 5 for a summarized overview.
Markov decision process. A common framework for modeling sequential decision-making processes is Markov
Decision Process (MDP) [Markov 1954]. An MDP is defined as a tuple (S, A, p, R, ρ) where S is the set of states,
A is the set of actions, p : S × A → ∆(S) is the transition function, R : S × A → R is the reward function,
and ρ : S → ∆(S) is the initial state distribution. Each sequential time step of the process is denoted by t, and
5
r1 r2 r3
o1 o2 o3 … state
action
transition
πθ πθ πθ ri reward
policy
c c o1 c o1 o2
Figure 2: Text generation from LLMs modeled as a Markov decision process. The generation process is auto-regressive,
utilizing the token output (action) from the previous time step and the context (state) as input to produce the next token
through the language model (policy). Given a context c, the language model produces the token o1 at the first timestep.
A concatenation of the two [c, o1 ] forms the input to the policy at the next timestep (Table 1). A reward function scores
the generated output for a given context.
st , at , rt denote the values of the state, action, and reward at time step t. A discounting factor γ ∈ (0, 1] is defined for
discounting rewards over time, particularly useful for modeling an MDP with an infinite number of time steps (i.e., an
infinite-horizon MDP). However, the outputs of language models are truncated after a finite number of steps. We use T
to denote the maximum time step.
An agent acts in an MDP using a policy π : S → ∆(A). The agent starts in state s1 ∼ ρ(·). At time step t, it chooses
an action at ∼ π( · | st ), executes the action, transitions to a new state st+1 ∼ p( · | st , at ), and receives a reward
rt = R(st , at ). The term “Markov” in MDP refers to the Markov property, in that the distribution over the next state
st+1 depends on only the current state st and action at .
Language models as agents in MDP. For simplicity, we consider text generation tasks that include only one turn of
interaction between the user and the model. We make a distinction between the text that a user inputs into the model,
denoted by c and referred to as the context or the prompt, and the text that the model generates by itself to the context,
denoted by o and referred to as the output or simply the generated text.
Let V be the set of all tokens that the model can generate (the vocabulary), C the set of all possible contexts, and
O the set of all possible outputs. Given a context c ∈ C as input, the model generates an output o ∈ O token by
token. Specifically, let ot be the t-th token in generated output o, then the model parameterized by θ first outputs token
o1 ∼ πθ ( · | c), and then conditioned on the concatenation of o1 and c it generates o2 ∼ πθ ( · | [c, o1 ]), and so on.
We can see that this generation process resembles an agent traversing in an MDP (Figure 2). The model acts according
to a policy πθ . The start-state distribution ρ is the distribution over user-provided contexts. The action space is the
vocabulary V . The action at is the generated token ot . The state st is the concatenation of the context c and all the
tokens the model has generated up to time step t − 1. The transition function p( · | st , at ) = δ([st , at ]) is a delta
distribution, i.e., the next state is deterministic given the current state and action. Reward rt given at time step t is
computed by the reward model as Rϕ (st , at ) which is either a human or a function learned from human feedback.
The text generation MDP has several special properties:
1. The action space is extremely large. For example, the LLaMa model [Touvron et al. 2023a,b] employs a
vocabulary of size 32K. Having a gigantic action space blows up the search space for reinforcement learning
algorithms.
2. The structure of the state space is complex, as a state is essentially a text sequence. Pre-training on large
amounts of texts is necessary to learn an initially good representation of this space.
6
MDP element Description Text generation equivalence
s1 Initial state c
st State at time step t [c, o1:t−1 ] = [s1 , a1:t−1 ] = [st−1 , at−1 ]
at Action taken at time step t ot
π(at | st ) Policy πθ (ot | [c, o1:t−1 ])
rt Reward at time step t Rϕ ([c, o1:t−1 ])
ρ(s1 ) Initial state distribution Pr(c)
p(st+1 | st , at ) Transition function δ([st , at ])
Table 1: Mapping from text generation to MDP
3. The initial state distribution has an enormous support. All conceivable contexts lie in the support, thus strongly
testing the ability of the policy to generalize to out-of-distribution states.
4. The reward function used for training can differ from the evaluation reward function. This is because the
humans providing rewards during evaluation may be different from the humans involved in trainnig the reward
model. Analogous to transfer learning in RL, the agent must then adapt to the new reward function.
5. The transition function is deterministic. Algorithmic and analysis tools tailored for deterministic MDPs can be
applied.
Thus, solving a text generation MDP requires specialized treatment that takes advantage of its properties and overcomes
its inherent challenges. Reinforcement learning [Sutton and Barto 2018; Bertsekas and Tsitsiklis 1996] provides
solutions for optimally solving an MDP, i.e., learning a policy that maximizes the accumulated reward. Consequently,
RLHF updates the language model to generate more rewarding outputs. Naturally, the reward function plays a critical
role in the process of fine-tuning model outputs, determining practical and fundamental limits [Casper et al. 2023] of
the efficacy of RLHF.
An implicit assumption made in RLHF is that a human’s feedback behavior is governed by and can be represented as
an oracular reward function R⋆ : C × O → R. We assume that this function is deterministic in line with the current
methodology. The function takes as input a context c and an output o, and outputs a scalar number reflecting the
preference on o as a continuation of c. Because the [c, o] is essentially a state in the MDP formulation, the reward function
is essentially defined over states of the MDP. The language model that maximizes the oracular reward accurately
reflects the goals and preferences inherent in the human feedback, and maximization of this reward consequently aligns
the model with the human preferences. The oracular reward may not be accessible or learnable, but under the reward
hypothesis [Sutton 2004; Silver et al. 2021], the mere existence of such a reward may be assumed—though this may be
challenged [Knox et al. 2022]. The oracular reward forms the golden standard for training as well as evaluating any
language model.
In general, humans can give a variety of feedback. RLHF operates with feedback that discloses information about the
oracular reward function. Most methods focus on two types of feedback: point-wise numerical feedback (or rating),
and pairwise ranking feedback (or preferences). Providing ratings is the most straightforward way to communicate
the reward function. Given a pair (c, o), the rating is a scalar r = R⋆ (c, o). While ratings can be fed directly into a
1
Unless the task is specified in the input prompt itself, in which case the inputs differ.
7
reinforcement learning algorithm, learning a reward model takes advantage of the generalizability of the reward model
on unseen outputs and contexts.
Preference feedback compares two outputs generated for the same context. Given two outputs o and o′ generated for
context c, a human denoted a preference o ≻ o′ if the first input is preferred and o′ ≻ o otherwise. Preferences in their
raw form are not compatible learning signals for reinforcement learning algorithms. Hence, a reward model must be
learned for this type of feedback. To do so, an assumption must be made about the relationship between preferences
and R⋆ . We will discuss this in more detail in the next section. A discussion about the various methodologies used
for encoding preferences can be found in Section 7.5. An alternative approach for ranking outputs on the basis of
preferences is provided by the learning-to-rank paradigm [Liu et al. 2009].
Using preference feedback offers several advantages compared to using ratings. Firstly, we get more training data for
the reward model. In practice, people collect a ranking of N outputs and create preference pairs [Ouyang et al. 2022].
Collecting N ratings for N outputs provides we get N training points. Ranking N outputs provided N (N − 1)/2
pairwise comparisons. Second, preferences require assigning a only relative order rather than an absolute precise score
to an output; the latter task could take significantly more cognitive effort and is more prone to inconsistency. Finally, a
preference is presumably easier to provide because it offers a “baseline” for comparison (the worse output). In contrast,
when giving a rating, a human can rely on only the evaluation guidelines.
A note on stochastic rewards: The reward function is considered to be a deterministic mapping from text to a scalar
value. This amounts to averaging the preferences of all humans that provided human feedback. Moreover, it assumes
that a human must always rate an input-output pair with the same score, discounting the inherent variability of human
preferences. There are numerous scenarios—like personalization, in-context adaptation to ongoing dialogue, and
diverse output generation—where a deterministic mapping is limiting. The rewards are more appropriately modeled as
being stochastic, wherein each input-output pair is scored by a distribution over scalar rewards, say r ∼ Rhuman (· | c, o).
This modeling accounts for the two sources of uncertainty: (i) uncertainty over the specific human from a group of
humans who provide feedback, and (ii) variability in a human’s preferences due to changes in unobserved factors
[Nguyen et al. 2017]. Some work in reinforcement learning aims to address this by learning Bayesian preferences,
primarily for uncertainty quantification and safety analysis [Ramachandran and Amir 2007; Brown and Niekum 2019],
and can be adapted to model a distribution of preferences over text. Some recent efforts along these lines [Barnett et al.
2023] have proven to be effective. We focus on deterministic rewards for the analysis that follows.
Learning a reward model serves two purposes: (i) to convert RLHF into a canonical reinforcement learning problem,
and (ii) to reduce the cost of online feedback-collection. Reinforcement learning algorithms define their objective in
terms of a reward function. To apply these algorithms, we need to infer a reward function from a feedback dataset,
collecting which is notoriously expensive. Currently, large language models require thousands to millions of feedback
data points. To gather that amount, many human evaluators need to be recruited to work in parallel. To ensure the
assumptions regarding the oracular reward function hold, the evaluators must be trained to agree with one another
on the evaluation criteria. This process is continual: multiple rounds of feedback collections need to be conducted to
iteratively improve the model. The premise of approaches that learn a reward model is that the generalization error of
the reward model is expected to decrease faster than that of the policy as a function of the number of labeled data points,
arising from the notion that supervised learning is often considered a simpler problem than generative modeling.
Following the previous section, we denote the reward model by Rϕ (c, o) and the feedback dataset by DHF . Our goal is
to decide a likelihood function Pr(DHF | ϕ) and find ϕ that maximizes this function:
max Pr(DHF | ϕ) (6)
ϕ
With rating feedback, the reward-modeling problem can be formulated as a prediction problem with continuous output.
A common objective for this type of problem is the minimization of the mean squared error (MSE):
X
min (Rϕ (c, o) − r)2 (7)
ϕ
(c,o,r)∈DHF
To incorporate preference feedback, we need to choose the form of the likelihood function denoting each preference,
i.e., Pr((o ≻ o′ , c) | ϕ). The RLHF method of Ouyang et al. [2022] employs the Bradley-Terry model to represent the
likelihood of a data point:
exp(Rϕ (c, o))
Pr((o ≻ o′ , c) | ϕ) = (8)
exp(Rϕ (c, o)) + exp(Rϕ (c, o′ ))
8
1
= (9)
1 + exp(Rϕ (c, o′ ) − Rϕ (c, o))
= σ[Rϕ (c, o) − Rϕ (c, o′ )] (10)
1
where σ(x) = 1+e−x is the sigmoid function. The learning objective for maximizing the log-likelihood of the dataset
DHF is,
X X
max log Pr((o ≻ o′ , c) | ϕ) = max log σ[Rϕ (c, o) − Rϕ (c, o′ )] . (11)
ϕ ϕ
(c,o,o′ )∈D HF (c,o,o′ )∈D HF
In Section 5, we further generalize the form of feedback and the likelihood function to conduct an analysis independent
of the specifics of particular design choices.
Evaluation of natural language tasks is a difficult problem, and the study of evaluation metrics is an active area of
research. Of particular importance, and difficulty, is to measure the alignment of a language model to a human’s
objectives, which in practice is evaluated along the axes of helpfulness, harmlessness, and honesty. The oracular reward
that governs a human’s preferences serves as a yardstick for measuring the degree of alignment. The task of alignment
is then reformulated as encoding the preferences demonstrated by a human into a reward function, and updating the
parameters of the language model to produce output that maximizes this reward.
A reward provides an analytical metric to measure the overall performance of a language model π, where the performance
captures the degree of alignment with human preferences along with the degree of satisfaction of the task itself [Ngo
et al. 2022]. The performance of a model π, for distribution over contexts dC (·), can be measured by averaging the
rewards for the outputs generated by π given the contexts. Let the performance be denoted by J(π):
XX
dC (c)π(o | c)R⋆ (c, o) = Ec∼dC (·) EO∼π(·|c) [R⋆ (c, O)|C = c]
J(π) := (12)
c o
The context distribution dC (·) can be the distribution of contexts in the training data, test data, or a held-out validation
dataset, depending on the data on which the performance of the model is being evaluated. The sequential nature of the
output generation equivalently allows us to express J(π) as:
" # " #
X X X
⋆ ⋆
J(π) := Eπ R (st , at ) = Eπ R (c, o1:t−1 , ot ) = Eπ [R⋆ (c, o1:t−1 , ot )] (13)
t t t
In practice, most current reward models only provide a reward after the complete output has been generated and
Equation 13 reduces to Equation 12. The definition of J(π) uses the oracular reward that is not accessible in practice.
An estimate of the performance can obtained from the estimated reward Rϕ , by plugging it into Equation 12:
J(π)
b := Ec∼dC (·) EO∼π(·|c) [Rϕ (c, O)|C = c] (14)
The pre-trained language model is denoted by πpre : C → ∆(O) and the model updated using RLHF by πrlhf : C →
∆(O). The goal of RLHF is to update the parameters of πrlhf such that J(πrlhf ) ≥ J(πpre ), i.e., as evaluated using the
b rlhf ) ≥ J(π
oracular reward. In practice, it is only possible to verify that J(π b pre ), which may be non-informative when
the estimated reward model Rϕ has inaccuracies for the context-output pairs being evaluated (Section 6.2).
9
The generality of this formulation allows the following analysis to cover all existing RLHF-style approaches (for
example, RLAIF [Bai et al. 2022b]) as well as future methods for fine-tuning LLMs, that employ a reward model.
Let D := {(c, o) : c ∈ C, o ∈ O} denote a hypothetical dataset of all possible contexts and outputs that a language
model can encounter, i.e., a humongous dataset of size |C| × |O| = |V |T . This dataset cannot be realized in practice and
is invoked to shed light on the practical limitations of the existing methodology. Denote the dataset of collected human
feedback by DHF := {(c, o, feedback) : c ∈ CHF , o ∈ OHF } where CHF ⊂ C, OHF ⊂ O are the subsets of context-
output pairs (human-)annotated with feedback.2 The reward encoding mechanism that maps context-output pairs
along human feedback to rewards (for instance, the Bradley-Terry model) is denoted by Ω : (c, o, feedback) → R.3
To uncover R⋆ , it is assumed that Ω accurately maps back human feedback to the oracular reward, i.e., for sufficiently
informative feedback, we have
Ω(c, o, feedback) = R⋆ (c, o), ∀ c, o ∈ CHF , OHF .
Under that assumption, Ω can operate on DHF to create a dataset of context-output-reward tuples, Drew = {(c, o, r) :
c ∈ CHF , o ∈ OHF } where r = R⋆ (c, o). With Drew , learning the reward model Rϕ reduces to a regression problem
employing a function approximator. The regression problem is however underdetermined [Bishop 2006], and conse-
quently multiple Rϕ functions can perfectly fit the training data Drew . However, almost all of these functions fail to
accurately represent the oracular reward (Figure 3). Due to the cost of human annotation, practically human feedback
can be collected on a very small subset of context and output pairs, i.e., CHF , OHF ⊂ C, O. The size of the reward and
feedback datasets relative to the hypothetical dataset of all possible inputs and outputs D can be measured by:
|CHF |
1. Context coverage: κ := |C|
|OHF |
2. Output coverage: ρ := |O ′ | , where O′ = {o : (c, o) ∈ D, ∀c ∈ CHF }
Well-understood results in supervised learning suggest that the ratios ρ and κ along with the generalization capabilities
of the function approximator [Nakkiran et al. 2019; Schaeffer et al. 2023; Bailly et al. 2022] determine the generalization
performance of the reward model for (c, o) ∈ C, O. In practice, the values of ρ and κ are extremely small and consquently
the reward model often incorrectly generalizes on unseen (out-of-distribution) context-output pairs, assigning incorrect
rewards to such inputs. In the following sections, we study practical limitations of estimating reward models.
The reward model parameterized Rϕ : C × O → R is trained on Drew using a sufficiently representative function
approximator, to perfectly fit the training data, that is Rϕ (c, o) = R⋆ (c, o), ∀ o, c ∈ Drew . The limitations of the
resultant reward model may be studied under the following categories:
Misgeneralization: Human feedback is obtained on a very small subset of all possible context-output pairs. This
partial coverage over contexts and outputs in Drew combined with the use of function approximators for learning the
reward model results in the reward model Rϕ (c, o) incorrectly generalizing to data points that are out-of-distribution
relative to Drew. We have assumed a sufficiently
representative function approximator that perfectly fits the training
data, Ec,o∼Drew (R⋆ (c, o) − Rϕ (c, o))2 = 0. However, it cannot be ensured that Ec,o∈D ⋆
/ rew (R (c, o) − Rϕ (c, o))
2
will be zero. It would require a function approximator to perfectly generalize outside the training data distribution,
which is not generally attainable, especially when the ratios ρ, κ are minuscule.
The benefits of reinforcement learning algorithms over other methods for finetuning are contingent on access to an
accurate reward function (Section 6.3), necessitating accurate out-of-distribution generalization of the reward model.
The inaccurate extrapolation out-of-distribution results in an ‘imperfect’ reward model that provides feedback on
context-output pairs in a manner that when optimized, arbitrarily misaligns with human feedback (and resultant
preferences) for those context-output pairs. The output distribution of πrlhf trained on this inaccurate feedback and
can only be as good (or bad) as the reward signal provided by the reward model. This inaccurate generalization in
the reward model is one of the primary causes of phenomena like ‘reward hacking’ and ‘hallucinations’ [Kalai and
Vempala 2023], observed in practice.
Delayed feedback and Reward Sparsity: Reinforcement learning algorithms benefit from dense rewards as they
serve to quickly guide the agent to rewarding states, providing informative feedback to intermediate actions along the
2
The subsets are significantly smaller than C and O. Additionally, the feedback can be of any form: ratings, pair-wise feedback,
or language feedback (Section 7.3).
3
feedback is overloaded to capture additional mechanism-specific meta-data. For instance, for pair-wise preference, feedback
can store the preference relation and the (c, o) pair compared against.
10
oracular reward R ⋆(c, ⋅ )
estimated reward Rϕ(c, ⋅ )
o ∈ 𝒟rew o ∉ 𝒟rew o ∈ 𝒟rew o ∉ 𝒟rew o ∈ 𝒟rew
reward value
outputs o
Figure 3: The reward model tends to misgeneralize for inputs not found in its training data, i.e., for (c, o) ∈ / Drew .
This occurs in two ways: 1) when the context is not sampled by the prompting distribution for generating output and
receiving feedback on (represented by κ), and 2) when the support of the output generating distribution—the language
model—for a context does not span all possible outputs (represented by ρ). The latter is depicted in this figure.
trajectory. In RLHF, the feedback from human annotators is obtained for complete output generations. Consequently,
the reward model is trained to provide reward feedback only at the end of the generated output for a given context.
This delayed feedback increases the difficulty of optimization with RL algorithms, increasing their sample complexity.
Sparse feedback is a constraint inherent to dealing with text and language [Sokolov et al. 2016], as it is often unlikely
for a human to provide feedback on incomplete sentences. Methods in RL developed to deal with sparse feedback, for
instance by stitching together information from partial trajectories [Andrychowicz et al. 2017], cannot be applied directly
to textual output due to the semantic constraints of dealing with partial sentences. Denser rewards and corresponding
feedback result in faster training, improved sample efficiency [Wu et al. 2023d], and potentially better generalization.
Insights from linguistics may be employed to obtain feedback on partial output generations and in turn denser rewards.
Marginalization over preferences: The reward model averages over the preferences of all human annotators (and
other sources of feedback) to output a deterministic scalar reward for a given context-output pair. The expectation is
that averaging over the preferences of multiple sources would be representative of the preferences of an average human
persona [Deshpande et al. 2023b]. The results in rewards that are inconsistent with any single human’s preferences.
Such preferences are more appropriately denoted by an distribution of rewards for a context-output pair. A deterministic
model, in addition to discounting the uncertainty and variability of human preferences, cannot model such a distribution,
highlighting a case of model misspecification.
The reward model forms the core component of RLHF and dictates the performance of a language model. The
aforementioned shortcomings of the reward model highlight the need for safety measures that must be employed while
using a language model fine-tuned using RLHF.
Policy gradient algorithms update the parameters of an agent’s policy using reward feedback. Being gradient-based
algorithms, their update rule is of the form:
θ ←− θ + α∇θ J(πθ ) (15)
11
where J(πθ ) is the performance (Equation (12)) of the policy parameterized by θ. The gradient of the performance of a
policy ∇θ J(πθ ) can be estimated from samples in numerous ways, each affording varying degrees of variance and
estimation error. In sparse rewards settings, the gradient estimation variance is a common issue that baselines [Mei
et al. 2022] help address. A class of methods called actor-critic methods update the policy by leveraging estimated
value functions, called critics, to reduce gradient estimation variance. The algorithm used for training most current
state-of-the-art large language models, Proximal Policy Optimization (PPO) [Schulman et al. 2017] is an actor-critic
algorithm with improvements over vanilla actor-critic to ensure stability during training. The improvements restrict
parameter updates at each iteration to prevent the policy distribution from drastically changing. The training loss
objective for PPO (PPO-Clip) takes the form:
πθ (at | st ) πθ (at | st )
Lppo-clip (θ) = E min Â(st , at ), clip , 1 − ϵ, 1 + ϵ Â(st , at ) (16)
πθold (at | st ) πθold (at | st )
where Â(st , at ) is the estimate of the advantage function A(st , at ) := Q(st , at ) − V (st ) that captures the advantage
obtained in terms of cumulative reward by taking an action at from state st and then following the current policy,
relative to following the policy starting from state st . While this background suffices for the discussion in this paper, we
urge the reader to refer to Weng [2018] for a more in-depth explanation of the topic.
In practice, a KL penalty DKL (πθ ||πpre ) with some weight β is added to the PPO training objective. This can be
interpreted either as a regularizer or a prior which helps prevent overoptimization of an imperfect reward model. Using
a reward model Rϕ , the policy at convergence learnt by training with the updated PPO objective can expressed directly
as a function of the reward [Scheurer et al. 2023; Rafailov et al. 2023] as,
1
πrlhf (o | c) ∝ πpre (o | c) exp Rϕ (c, o) (17)
β
where β is the weight on the KL penalty. Let Crew ⊂ C be the set of contexts in Drew .4 After training, πrlhf must generate
desirable (most rewarding) outputs when prompted with c ∈ Crew . But for out-of-distribution contexts, where the reward
estimation may be erroneous, the output distribution of πrlhf may be arbitrarily misaligned with human preferences and
generate undesirable output. This misalignment can be quantified by comparing against the policy trained with the
oracular reward. The set of contexts on which the performance of πrlhf is evaluated is denoted by Ceval with dCeval being
the distribution over those contexts. Let C ′ = Ceval /Crew be the set of contexts in the evaluation set that are not present
in Crew . The performance of πrlhf is given by:
where (a) is permitted by the following: ∀o, c : πrlhf (o|c) > 0, πpre (o|c) > 0 and (b) follows from Equation (17). Let π ∗
be the policy trained using the oracular reward with RLHF. It can be expressed as:
∗ 1 ⋆
πrlhf (o | c) ∝ πpre (o | c) exp R (c, o)
β
4
Note that Crew is the same as CHF . The subscript is used for clarity under the current context.
12
∗
The performance of πrlhf can be written as:
∗ ⋆
∗ (·|c) [R (c, o)]
J(πrlhf ) = Ec∼dCeval (·), o∼πrlhf
1 ⋆
X 1 ⋆
R (c, o) R⋆ (c, o) +
X
∝ dCeval (c)πpre (o | c) exp dCeval (c)πpre (o | c) exp R (c, o) R⋆ (c, o)
β β
c∈Crew , c∈C ′ ,
o∈O o∈O
| {z }
out-of-distribution
∗
The performance gap ∆J := |J(πrlhf ) − J(πrlhf )| caused by the imperfections in the reward model can be quantified as,
X 1 ⋆ 1
∆J ∝ dCeval (c)πpre (o | c) | exp R (c, o) − exp Rϕ (c, o) R⋆ (c, o)
′
β β
c∈C , o∈O
For out-of-distribution contexts and outputs, the reward model is known to misgeneralize. The performance gap
increases with increasing discrepancy from the oracular reward, and the discrepancy is further weighted by the
likelihood of that (c, o) pair and its oracular reward value. Some observations from the above analysis:
• πrlhf assigns high probability to highly rewarding outputs (Equation (17)), which is beneficial in-distribution
contexts but can be harmful for out-of-distribution contexts when the reward model is erroneous.
• The deviation of the estimated reward from the oracular reward on unseen contexts exacerbates misalignment,
which can be mitigated by increasing the weight on the KL penalty due to the 1/β dependence in the exponent.
∗
• However, there is a trade-off. Increasing the value of β results in πrlhf and πrlhf being closer to πpre and have a
lowered performance—due to increased weight in the KL penalty.
The efficacy of RLHF heavily relies on the quality of the reward model, and thus a large fraction of future research
must focus on improving the reward model. Before allocating resources to that effort, it is essential to evaluate the
merits and downsides of employing reinforcement learning as the fine-tuning paradigm. In comparison to supervised
learning as an alternative approach, examining the gradient updates of a (vanilla) policy gradient algorithm alongside
those of a supervised learning algorithm (such as supervised fine-tuning) offers some insights.
Comparing Update Rules of Supervised Fine-Tuning and RLHF: In supervised fine-tuning (SFT), supervision is
provided with positive samples, and the language model is updated to increase the likelihood of those samples under the
model. Notably, there is no supervision provided for neutral or undesirable outputs, although it is a feasible option.
Given the optimal policy π ∗ (which may be a human expert), the objective of SFT is,
max Ec∼dC ,ow ∼π∗ (·|c) ln πθ (ow |c)
θ
and thus the gradients used to update the parameters of the language model are of the form:
∇θ := Ec∼dC ,ow ∼π∗ (·|c) ∇θ ln πθ (ow |c) .
This is analogous to behavior cloning in RL [Pomerleau 1988] which is known to struggle when faced with out-of-
distribution inputs.
The primary benefit that reinforcement learning algorithms provide is that they allow the language model to explore the
output space. Through its decoding algorithm, the language model exercises control over the distribution of outputs on
which feedback is acquired. This facilitates learning from both positive as well as negative feedback, i.e.,
max Ec∼dC ,o∼πθ (·|c) R⋆ (c, o)
θ
and the (vanilla) policy gradient update is:
∇θ := Ec∼dC ,o∼πθ (·|c) R⋆ (c, o)∇θ ln πθ (o|c) .
As highlighted in color, in SFT, the gradient is estimated only from the positive samples, while in RL, it is computed for
all samples (positive, negative, or neutral) weighted by their corresponding rewards. The gradient updates in RL are
more informative, leading to better generalization for the language model and improved sample efficiency. Beyond
exploration and richer gradients, the field of inverse reinforcement learning provides a natural formulation for training a
language model with human feedback [Arora and Doshi 2021].
In the following sections, we present a review of works that lead up to and are being rapidly added to this active area of
research. This review provides context for the first half of this work and also serves as a comprehensive introduction for
readers interested in getting started and understanding the topic of RLHF for language models.
13
7 Review of Reinforcement Learning from Human Feedback for Language Models
7.1 Language Model Pre-Training: Foundation for Large Language Models
Language Models (LMs) have gained significant attention in recent years due to their impressive abilities to model
language and retain textual knowledge. The Transformer architecture, characterized by its use of self-attention
mechanisms, has become the standard for LMs [Vaswani et al. 2017]. It is employed in a range of models, including
BERT, T5, LLaMA, GPT-3, PALM, GLaM [Devlin et al. 2019; Raffel et al. 2019; Touvron et al. 2023a; Brown et al.
2020; Chowdhery et al. 2022; Du et al. 2021].
Pre-training has played an important role in the development of Large Language Models (LLMs), significantly
contributing to their remarkable performance across a myriad of downstream tasks [Brown et al. 2020; Chowdhery
et al. 2022; Zhang et al. 2022]. This process involves training models with an unsupervised training objective on
extensive datasets, often comprised of a diverse mix of web content, literary works, scientific documents, and code
repositories [Rae et al. 2021; Xie et al. 2023]. The scale of these datasets is critical, with studies highlighting the
superior performance of smaller models trained on larger datasets [Kaplan et al. 2020; Hoffmann et al. 2022; Touvron
et al. 2023a]. In addition to scale, the quality of training data, ensured through deduplication and filtering of low-quality
content, is a key determinant of model performance [Rae et al. 2021; Du et al. 2021; Hernandez et al. 2022; Lee
et al. 2021]. Masked Language Modeling (MLM) [Devlin et al. 2019] and Causal Language Modeling [Radford and
Narasimhan 2018] are the most common objectives used for pretraining, with latter showing notable success in recent
Large Language Model series such as GPT, PaLM, OPT [Anil et al. 2023; OpenAI 2023; Zhang et al. 2022].
Studies demonstrate that pre-training by itself is responsible for the bulk of the observed capabilities even in downstream
tasks [Brown et al. 2020; Raffel et al. 2019]. The simple pre-training objective of next, or masked, token prediction
imbibes the LMs with a range of capabilities. They are few-task learners, without the need for fine-tuning. This applies
to a variety of tasks from text generation, reasoning, question answering, summarization, and translation to name a few.
However, though scaling pretrained language models (PLMs) exhibit remarkable performance across a variety of tasks,
they suffer from several limitations, such as the inability to follow human instructions [Ouyang et al. 2022]. This is
because PLMs suffer from objective mismatch problems (See Section 2), as they are trained on generic internet data. As
a result, PLMs need to learn to mimic the conflicting behavior of billions of humans. Further, the Maximum Likelihood
Estimate on the next token prediction for such data doesn’t explicitly penalize the model for hallucinating concepts,
i.e., generating concepts not encapsulated within its internal representation, and even important & unimportant errors
are given equal weightage. Moreover, pretrained models often show unintended behavior such as generating harmful,
biased, untruthful, and low-quality content [Perez et al. 2022].
Supervised-Finetuning To address the shortcomings faced by PLMs, a straightforward approach is to fine-tune them
on a set of high-quality downstream datasets that are indicative of the intended task and behavior. For example, for
instruction-following, human annotations can be collected on a set of input prompts, or input instances of existing
public datasets can be re-formatted for instruction-following format. The model is then simply fine-tuned on these
human demonstrations, often with the same pretraining objective. This increases the likelihood of generating desirable
text and makes the model less biased and harmful. Nonetheless, in order to generate high-quality text, it is crucial to
note that the task of distinguishing between high and low-quality text is inherently subjective and challenging, with end
users being humans. Thus, quality assessment rests on human judgment and varies significantly based on the individual
evaluator’s perspective [Yi et al. 2019; Fan et al. 2022; Ziegler et al. 2019]. Incorporating human feedback into such a
process can be challenging, and collecting high-quality human demonstrations can be expensive and not scalable.
7.2 Reinforcement Learning from Human Feedback (RLHF): Overview and Motivation
The Importance of Human Feedback in Language Models The alignment of a model with the user’s intentions and
preferences is critical, and incorporating human feedback in model training is a key step towards achieving this (Section
2). However, the process of obtaining high-quality human feedback, particularly in the form of human demonstrations,
can be a resource-intensive process, both in terms of time and cost. A more efficient approach is to collect feedback on
the outputs generated by the model and train the language model to incorporate this feedback. However, collecting such
a large amount of feedback is also costly and impractical for real-time/online collection during training.
The Role of RLHF in Language Models Reinforcement Learning from Human Feedback (RLHF) offers a solution to
these challenges. In RLHF, human feedback is collected offline and used to train a reward model. This reward model then
acts as a surrogate for human feedback during training, providing reward signals to the Language Model. Reinforcement
learning algorithms form the natural candidates for training a model from scalar evaluative feedback, as provided by the
reward model. This forms the essence of Reinforcement Learning from Human Feedback (RLHF) [Christiano et al.
14
Pretrained
Language GPT-3 [Brown et al. 2020], PALM [Chowdhery et al. 2022], OPT [Zhang et al. 2022],
Models LLaMA [Touvron et al. 2023a]
(§7.1)
[Kreutzer et al. 2018b], InstructGPT [Ouyang et al. 2022], [Bai et al. 2022a],
Preference Based
Sparrow [Glaese et al. 2022]
Reinforcement Learning from Human Feedback in Language Models
Human Rating Based [Kreutzer et al. 2018a], [Liu et al. 2018], [Fan et al. 2022]
Feedback
(§7.3) Language Based [Li et al. 2016], [Scheurer et al. 2023], [Nguyen et al. 2021]
Miscellaneous
Sparrow [Glaese et al. 2022], [Uesato et al. 2022], [Wu et al. 2023c],
Feedback
Supervised
Fine-Tuning [Wei et al. 2021], [Zhou et al. 2023], [Chiang et al. 2023]
(§7.4)
PPO
InstructGPT [Ouyang et al. 2022], [Bai et al. 2022a], [Touvron et al. 2023b]
[Schulman et al. 2017]
Actor-critic
Algorithm [Bahdanau et al. 2016] Sparrow [Glaese et al. 2022], GopherCite [Menick et al. 2022], [Perez et al. 2022]
[Nguyen et al. 2017]
Others [Ramamurthy et al. 2022], [Scheurer et al. 2023], [Munos et al. 2023]
RL-Training
(§7.6)
Translation [Nguyen et al. 2017], [Kreutzer et al. 2018a], [Kiegeland and Kreutzer 2021]
Summarization [Stiennon et al. 2020], [Nguyen et al. 2022], [Ziegler et al. 2019],
Reward Task
Models Dialogue InstructGPT [Ouyang et al. 2022], [Bai et al. 2022a], Nano [Fan et al. 2022],
Citing
[Menick et al. 2022], [Nakano et al. 2021]
Answers
Non-RL
[Dong et al. 2023], [Yuan et al. 2023], [Scheurer et al. 2022]
Training
Reward
Alternatives [Zhao et al. 2023], [Liu et al. 2023a], [Bai et al. 2022b],
(§7.9)
Figure 4: Categorization of different components in the RLHF and example representative works from literature.
2017] as used to train Language Models. This approach is more sample-efficient and has shown more promising results
compared to supervised fine-tuning alone [Ouyang et al. 2022].
Applications of RLHF in Language Models In early works, RL has been used in training Language models
across various domains such as dialogue generation [Yi et al. 2019; Li et al. 2016; Jaques et al. 2019], machine
translation [Kreutzer et al. 2018a; Nguyen et al. 2017; Fernandes et al. 2022; Sokolov et al. 2016], text generation [Li
et al. 2017; Shi et al. 2018; Zhou and Xu 2020; Ziebart et al. 2008], semantic parsing [Lawrence and Riezler 2018],
summarization [Stiennon et al. 2020; Ziegler et al. 2019; Wu et al. 2021]. More commonly, these methods were trained
using non-differentiable automated evaluation metrics such as BLEU, ROUGE [Ranzato et al. 2015; Shen et al. 2015;
Keneshloo et al. 2018], or simulated feedback [Nguyen et al. 2017]. However, while the combination of RL and human
feedback has been extensively studied [Knox and Stone 2008; Christiano et al. 2017], it is only recently that RLHF with
LLMs has achieved significant success in sequence-to-sequence tasks such as Summarization [Stiennon et al. 2020;
Ziegler et al. 2019; Wu et al. 2021], providing reliable answers with citations to queries [Nakano et al. 2021; Glaese
et al. 2022], creating Helpful, Harmless and Honest dialogue agents aligned with broad human values [Ouyang et al.
2022; Bai et al. 2022a].
Formulating Language Modeling as a RL Problem Reinforcement Learning (RL) is a learning paradigm for
a setting where an agent must make a sequence of decisions while interacting with an environment and obtaining
evaluative feedback in the form of rewards. The agent’s objective is to maximize the total reward it receives over time.
In the context of language models, the agent is the language model itself, and its actions consist of generating tokens
from its vocabulary. The agent’s policy, which maps states to actions, is represented by the language model’s parameters.
The agent receives rewards from the environment, which in this case is a reward function that forms a surrogate from
human feedback (Section 4). The agent’s objective is to optimize its actions (by updating its policy) to maximize
the cumulative reward. A thorough mathematical formulation can be found Section 3, and has been summarized in
Table 2 and Figure 2. While these details are sufficient for further discussion in the paper, we refer interested readers to
Arulkumaran et al. [2017]; Sutton and Barto [2018] for more details about reinforcement learning.
The Workflow of RLHF RLHF, as first popularized by [Christiano et al. 2017] for mastering Atari Games consists
of three crucial stages. An overview of standard RLHF workflow is highlighted in Figure 5. The first stage involves the
15
Figure 5: Workflow of RLHF. A pretraining phase, and optionally supervised finetuning (SFT) on human demonstrations,
is followed by all RLHF workflows for training language models. This is followed by an iterative loop starting with
collecting human feedback on model-generated outputs, training a reward model, and updating the language model
using a suitable RL algorithm.
Table 2: Mapping of terms used in RL literature to training of language models through RLHF. See Table 1 for
mathematical formulation.
collection of human feedback on a set of <input, output> pairs. These pairs can be sourced from existing datasets
or generated by the pre-trained model for a given set of input prompts. The second stage involves learning a reward
model from the collected human feedback. The reward model is trained to output a scalar reward for a given <input,
output> pair, indicating the favorability of the pair. In essence, the reward model is trained to mimic human feedback,
such that for a given input, desirable outputs are scored higher than undesirable outputs. The final stage involves the
RLHF training of the language model, where the reward model provides reward signals on model outputs, usually in the
form of scalar reward. The parameters of the language model are then updated based on these reward signals using an
appropriate policy-gradient RL algorithm, updating the model to produce more rewarding outputs.
These stages can be performed iteratively, with the intermediately trained model generating more prompts to collect
additional human feedback. This feedback is then used to train the reward model, and the process is repeated multiple
16
times [Stiennon et al. 2020; Bai et al. 2022a; Menick et al. 2022]. In the following sections, we discuss each of
these stages in detail. We start with Human Feedback Collection (Section 7.3), followed by training the Initial Policy
(Section 7.4), Reward Model Training (Section 7.5), and finally RLHF Training (Section 7.6). Finally, we discuss the
properties of RLHF-trained models and their limitations in Section 7.7.
In this section, we discuss the nature, objectives, and different types of human feedback, followed by the challenges and
strategies associated with collecting high-quality feedback.
Tasks such as summarization and providing helpful answers are inherently ambiguous and require human judgment to
evaluate the quality of the generated text. Automated metrics like BLEU and ROUGE [Lin and Och 2004] often do
not correlate with human judgment [Liu et al. 2016; Schluter 2017; Sellam et al. 2020; Stiennon et al. 2020], making
them unreliable for evaluation and training. Thus, acquiring high-quality human feedback to align the model with
human behavior becomes crucial. Feedback is typically provided on the outputs generated by the model (or input-output
pairs from the dataset), and subsequently, the model is trained to learn from this feedback. However, capturing diverse
human preferences is a challenging task. One approach to encapsulate subjective human preferences is to approximate
them using “models of human behavior”. This concept of human behavior models has roots in diverse fields such
as econometrics [McFadden 1981], psychology [O’Connor 1989], and inverse reinforcement learning. A notable
example is the Bradley-Terry model [Bradley and Terry 1952], a probabilistic model that encodes the preference of one
output over another in pairwise competitions. In the context of RLHF, reward models that form surrogates for human
preferences serve as such models of human behavior.
The type of feedback collected depends on the intended objective to be displayed by the fine-tuned language model.
Askell et al. [2021] proposes three objectives for an aligned Language Model: Helpfulness, Honesty, and Harmlessness
(HHH). These objectives can be broadly defined as follows:
- Helpful: A Language Model is considered helpful if it can efficiently complete tasks or answer questions (while
being harmless), ask relevant follow-up questions when necessary, and appropriately redirect ill-informed requests.
Helpfulness includes context-dependent aspects such as informativeness, coherence, relevance, creativity, and specificity.
- Honest: Honesty in a Language Model implies providing accurate information, expressing appropriate levels of
uncertainty, and honestly conveying its capabilities, knowledge, and internal state. Language Models are particularly
susceptible to hallucination [Khandelwal et al. 2019; Maynez et al. 2020], making it essential to penalize such behavior.
Unlike helpfulness, honesty is more objectively evaluated.
- Harmless: A harmless Language Model should avoid offensive or biased behavior, refuse to aid in dangerous acts,
recognize disguised nefarious attempts, and act with modesty and care when providing advice with potentially sensitive
or consequential impacts.
These broad objectives, as mentioned above, encompass specific objectives, which can be considered subcategories.
For example, in the case of summarization, the summary should be helpful to the reader and should not contain any
false or harmful information. Similarly, the goal of reducing bias in a dialogue agent’s responses can be considered a
subset of the Harmless objective. At the same time, coherence and creativity in the generated text are aspects of being
helpful. These objectives are not mutually exclusive and are context and task-dependent. Even human labelers and
researchers have shown disagreements in annotation [Kreutzer et al. 2018b].
Human Feedback is usually collected on model-generated outputs. Good feedback should incorporate information on
where the model output is lacking and how to improve it. A simple process is to let human labelers provide feedback on
a set of model outputs generated from a dataset of prompts or inputs. Alternatively, existing datasets can be repurposed
to incorporate implicit feedback, such as rating different user choices [Kreutzer et al. 2018a]. Regardless of the process,
human feedback can be collected in various forms, such as binary responses, preference ranking, language feedback,
etc. While the choice of feedback type depends on the downstream task, it is essential to note that the feedback should
be collected in a way that is easy for humans (labelers) to provide; there is high agreement among the labelers, and it
is also informative. In this section, we classify the feedback into four different categories: rating feedback, ranking
feedback, language feedback, and miscellaneous feedback.
17
Rating Feedback The simplest form of rating feedback is binary feedback, where the labeler is asked to provide a
binary response (yes/no) to a given input [Li et al. 2016; Scheurer et al. 2023]. Binary feedback is easy to collect and
interpret. Some works have used binary responses to get feedback on multiple questions (such as if the generated text is
coherent) [Yi et al. 2019]. A richer form of feedback is to ask labelers to provide a rating on a scale. The scale can be
continuous [Graham et al. 2013], or be similar to Likert Scale [Likert 1932] (where user rate using an integer from 1 to
k) [Kreutzer et al. 2018a; Jaques et al. 2019]. A different variant of rating feedback is to provide categorical feedback
such as ‘incorrect’, ‘partially-correct’, and ‘correct’ [Gao et al. 2023]. While rating feedback is easy to specify, often
inter-annotator agreement is low because of the subjective nature of the task [Kreutzer et al. 2018b]. Further, the order
of examples presented to the annotator may bias the results [Yannakakis and Hallam 2011]. Moreover, it is challenging
to differentiate between data points with outputs of similar quality since feedback is provided individually to each
output without comparison.
Ranking or Preference Feedback Ranking feedback or Preference-based feedback has been extensively used in
the recent development of AI assistants and found to be both convenient to collect and performative. Specifically, the
labeler is offered with binary [Stiennon et al. 2020] or multiple choice options [Ziegler et al. 2019], and asked to select
the most appropriate response based on a certain set of instructions (directions). Recently, [Zhu et al. 2023a] has shown
convergence guarantees for reward models trained using this feedback form. Moreover, given an input prompt, it is
common to ask labelers to rank k (> 2) generated responses, which are then repurposed as pairwise comparisons for
the reward model [Ouyang et al. 2022]. However, collecting pairwise feedback might still be difficult for near similar
responses and may result in much time spent by the labelers even on single input [Scheurer et al. 2023]. Additionally,
preference-based feedback provides a very sparse signal, conveying limited information about the reasoning behind the
provided feedback. Moreover, it is provided only on the complete text generated by the model (trajectory) and not on
specific parts of the text (particular state) [Pang et al. 2022; Lewis et al. 2017]. Moreover, preference-based feedback
provides no further improvement in terms of inter-annotator agreement when compared to rating feedback [Kreutzer
et al. 2018b].
Language Feedback A more informative way to provide feedback is in free-form language. This provides a dense
reward signal, specifying more precisely where the model goes wrong or needs improvement. For example, consider
the case where the output generated by the model is “A humorous story about a specific profession involving person A
and person B.“ The previous feedback forms would provide only sparse signals, such as indicating that the output is
inappropriate. However, this feedback alone will not help the model identify the cause of inappropriateness, and the
single example alone can imply that the text is inappropriate because: “it is wrong to create humor in general,“ “it is
wrong to create humor about specific professions“ or “it is wrong to involve individuals in humorous stories“ and so
on. On the other hand, free-form feedback can provide more precise feedback, such as “It is inappropriate to create
humor that targets specific professions.“ This enables the model to understand the issue from a single example better
and generalize to similar cases without learning from more examples.
Language Feedback has been extensively used in various domains such as Dialogue models [Li et al. 2016; Hancock
et al. 2019], Summarization [Scheurer et al. 2023], Question-Answering [Li et al. 2022], Code generation [Chen 2023].
Recently, [Scheurer et al. 2023] has shown that language feedback is more effective than preference-based feedback in
the context of summarization systems. Also, as [Hancock et al. 2019] discusses, getting preference-based feedback
is plausible for paid labelers but not for real users using real deployed systems. Real users interact with the system
through free-form language; hence, getting human feedback in the free-form language is more natural. Although task-
dependent, [Scheurer et al. 2023] further find that labelers take only 3x times to provide language feedback compared to
preference-based feedback, despite providing much granular information. However, incorporating language feedback in
the RLHF pipeline is not straightforward, and there has been limited work in this direction.
Miscellaneous Feedback Apart from providing single feedback, methods have experimented with using a combination
of feedback types or altogether different types. For example, [Glaese et al. 2022] uses a combination of rule violation
feedback (binary), preference-based feedback, and rating of evidence. [Uesato et al. 2022; Korbak et al. 2023] provide
segment-level feedback instead of the whole text, and [Wu et al. 2023c] provide feedback at the token level. Moreover,
some studies employ indirect methods for collecting feedback. For example, [Kreutzer et al. 2018a] uses human
interactions on translated eBay titles to find more preferred translations.
Further, it is also possible to provide computational feedback, for example, from automated metrics [Bahdanau et al.
2016], forms of synthetic feedback [Kim et al. 2023; Black et al. 2023], web descriptions [Hanjie et al. 2022; Aggarwal
et al. 2023], LLM generated feeedback [Shinn et al. 2023; Madaan et al. 2023; Yang et al. 2022a], which might, in
turn, be generated based on certain human requisites or instructions [Bai et al. 2022b; Sun et al. 2023; Kundu et al.
2023]. However, these methods still use little to no human feedback and may have several unexplored limitations such
as instability and lack of robustness [Shumailov et al. 2023; Alemohammad et al. 2023; Gudibande et al. 2023] and are
18
not the focus of this survey. We refer readers to Fernandes et al. [2023] for discussion on different type of feedback
used in Natural Language Generation.
Collecting high-quality human feedback is a challenging task that has been the focus of extensive research. The quality
of feedback is pivotal; subpar or noisy feedback can significantly hamper the performance of the final trained model. For
example, for summarization tasks, [Ziegler et al. 2019] discovered that their model predominantly extracted verbatim
lines from the document. This was later attributed to low-quality feedback by [Stiennon et al. 2020]. Similarly, the
size of the feedback is also crucial. For example, despite employing similar methodologies, [Lightman et al. 2023]
identified a need for a ‘greater amount of feedback’ for the methods in [Uesato et al. 2022] to be effective, as the
intended objective was not even observed in the latter work.
The provision of clear and unambiguous instructions to the labelers is a fundamental requirement [Ziegler et al. 2019;
Nakano et al. 2021]. Failure to do so can not only result in low-quality feedback but also introduce systematic bias
in the collected feedback and, consequently, the model [Parmar et al. 2022]. Typically, labelers are provided with a
comprehensive set of instructions, including guidelines for handling edge cases [Bai et al. 2022a]. [Glaese et al. 2022]
even provides a tutorial to the selected few labelers.
Researchers typically screen labelers to ensure they possess the necessary skills to provide feedback. For instance, in
the case of translation tasks, bilingual labelers with native proficiency in both languages are preferred [Kreutzer et al.
2018b]. Additionally, a minimum educational qualification is generally preferred. For example, [Stiennon et al. 2020]
requires labelers to have at least a high-school degree, whereas [Nakano et al. 2021], [Glaese et al. 2022] and [Bai et al.
2022a] require a minimum undergraduate and master’s degree respectively. The end goal also influences the selection
of labelers. For instance, creating a harmless and helpful chatbot necessitates a diverse group of labelers with varying
backgrounds and demographics [Ouyang et al. 2022; Bai et al. 2022a] as otherwise this may result in implicit biases in
the model [Peng et al. 2022]. For instance, currently deployed language models have been shown to reflect views more
aligned with western audiences [Durmus et al. 2023] and may have systematic political biases [Santurkar et al. 2023],
partly owing to the lack of annotators from diverse demographic groups.
However, despite screening, there may be low agreement among the annotators themselves, or even between researchers
and annotators [Kreutzer et al. 2018a]. The labelers are further screened based on two standard criteria 1.) inter-annotator
agreement, i.e., the agreement between different annotators on the same example, and 2.) expert-annotator agreement,
i.e., the agreement between annotators and experts [Kreutzer et al. 2018b]. Specifically, the former metric ensures that
the labelers are consistent in their feedback, and the latter metric is used to keep only those labelers that have a high
agreement with experts. [Menick et al. 2022] creates a group of super-raters who have a high agreement with experts,
and the group is expanded upon iteratively. Even after filtering, some methods ensure a hands-on relationship with
labellers [Stiennon et al. 2020] and have also created Slack groups for discussing any bugs, issues, or edge cases [Bai
et al. 2022a].
Upon the collection of high-quality feedback, the subsequent step is to assimilate this feedback to train the model. The
most direct method to achieve this is to perform supervised fine-tuning of the language model based on the collected
feedback. Specifically, human feedback is gathered in the form of expert outputs on input prompts, also referred to as
human demonstrations. These human demonstrations can be perceived as positive example outputs to prompts that
should be generated by the language model. The model is then fine-tuned on these demonstrations using the same
pretraining objective, and this process in RL terminology is often termed behavior cloning [Nakano et al. 2021].
Additionally, when dealing with preference data, the model can be directly fine-tuned on preferred feedback. However,
this approach exhibits limitations by not accounting for negative feedback—outputs that the model should avoid
generating. This is crucial for training robust models that can handle adversarial situations and identify and rectify
errors. To tackle this limitation, alternative methods that incorporate both positive and negative feedback have been
developed, as discussed in Section 7.9.
In addition to human demonstrations, existing public instances from NLP datasets can be used as instruction tuning
demonstrations [Wei et al. 2021]. This usually involves creating new instruction-tuning datasets by adding task
instructions to existing examples from the dataset [Ajith et al. 2023]. In another field of work, prompts from the initial
iterations of GPT-3 [Brown et al. 2020] served to real customers through Web API were used to fine-tune the model on
expert (human) demonstrations provided by contracted labelers [Ouyang et al. 2022].
19
Limitations of Supervised Finetuning While finetuning on supervised data enhances the model beyond its pretrained
version in following instructions and intended tasks, it suffers from numerous limitations. For instance, it does not
penalize the model for hallucinating or permit it to learn from neutral or negative feedback. This can lead to harmful
and unintended behavior, making it easier to prompt such models to elicit them [Ganguli et al. 2022; Perez et al. 2022].
Furthermore, behavior cloning is likely to perform poorly in out-of-distribution prompts [Pomerleau 1988]. These
limitations may stem from the fact that during behavior cloning, the model is not allowed to explore the vast space of
possible actions, i.e., the model is not allowed to generate outputs that are not present in the demonstrations and, in turn,
get feedback for them. We refer readers to Section 6.3 for theoretical discussion on the limitations of SFT.
SFT as Initial Policy in RLHF models Despite its caveats, supervised fine-tuning plays a pivotal role in RLHF as it
provides a robust initial policy, which allows RLHF methods to work well. From an RL perspective, learning algorithms
such as the widely used Proximal Policy Optimization (PPO) in training sequence-to-sequence models, struggle to
improve from poor initializations, especially when the action space is large, as in the case of text generation. This is
because these methods use model-based exploration, which is ineffective when the transition probabilities over many
actions are similar [Nguyen et al. 2017] i.e., different text outputs have similar probabilities of generation.
Furthermore, as we discuss in Section 7.6, usually a KL penalty is applied to ensure the output text generated by our
RL-tuned model is close to the initial model. Thus, during RL training, it is preferable to start with an initial model that
already generates decent-quality text.
Empirical studies have demonstrated that starting with fine-tuning on high-quality human demonstrations results in
significant improvements over starting with pretrained language models [Stiennon et al. 2020; Ouyang et al. 2022]. For
instance, InstructGPT collects API customer and labeler written prompts and outputs to fine-tune their model before
initiating with the RLHF training [Ouyang et al. 2022].
[Glaese et al. 2022] has also shown that starting with a prompted model (dialogue) instead of fine-tuning on label
demonstrations is possible. However, they start with a large model (70B parameters) and do not perform a comparative
study starting with fine-tuning on human demonstrations. Thus, it cannot be definitively concluded that starting with a
prompted model is equivalent to fine-tuning on human demonstrations. Moreover, prompting has the limitation of using
up a major portion of the context length of the model, which, apart from the computational burden, can also be crucial
for some tasks because of limited context length. [Askell et al. 2021] propose using context distillation by training the
model to generate output similar to its prompted counterpart using KL divergence loss. They find similar performance
to the prompted model, and the method has been used in their subsequent works [Bai et al. 2022a].
In conclusion, while supervised fine-tuning can be utilized independently of the RLHF pipeline, it still suffers from
several significant limitations. However, it still serves as an integral step in the RLHF pipeline, providing a robust initial
policy crucial for subsequent RL training.
Reward as a Proxy for Human Feedback After the collection of human feedback, the next challenge is training
the language model effectively. Although supervised fine-tuning offers a straightforward method, its effectiveness is
limited by the volume of human feedback. In contrast, RLHF introduces a reward model to emulate human feedback,
thereby acting as a stand-in for the true reward function, i.e., the actual human feedback. This reward model, usually
much smaller than the language model, facilitates fine-tuning the language model using feedback generated by it on
new model outputs, avoiding the need for additional costly human annotation. In practice, using a reward model over
supervised fine-tuning has been found more data-efficient [Ramamurthy et al. 2022].
Training a Reward Model The reward model is a fine-tuned language model that assigns a scalar reward score
to an input-output pair. The last embedding layer is replaced with a single projection layer that outputs this scalar
reward. While the reward model can learn from various types of feedback, recent studies highlight the simplicity and
effectiveness of preference-based feedback [Ouyang et al. 2022; Bai et al. 2022a]. This approach involves fine-tuning
the initialized reward model to predict the preference between two trajectories (output text) given the same input
prompt or context. The reward is typically modeled as a Bradley-Terry-Luce (BTL) model [Bradley and Terry 1952],
where the probability of preferring one trajectory over another is a function of the difference in their reward scores.
Mathematically, this can be represented as:
Pr((o ≻ o′ , c) | ϕ) = σ[Rϕ (c, o) − Rϕ (c, o′ )] (18)
where σ(x) = 1+e1−x is the sigmoid function, o and o’ represent the two trajectories, and their rewards are represented
as Rϕ (c, o) and Rϕ (c, o′ ) respectively. This form of reward modeling has been found to provide smoother rewards and
is less noisy [Christiano et al. 2017]. A similar method can then be used for ranking between k trajectories (k > 2),
20
where the reward is modeled as a Plackett-Luce (PL) model [Plackett 1975; Luce 1979]. Moreover, [Zhu et al. 2023a]
provides theoretical proof of convergence guarantees under the Maximum Likelihood estimate of both BTL and PL
models.
The size and initialization of the reward model are critical determinants of its performance. While smaller reward models
are easier to train, scaling laws suggest that larger models yield better agreement with actual human preferences [Askell
et al. 2021]. However, [Ouyang et al. 2022] found that training very large reward models can be unstable and result in
overfitting. Instead, they report good performance even when using a reward model that is 30 times smaller than the
policy model.
Regarding initialization, multiple methods have been proposed. While [Ziegler et al. 2019] fine-tunes a pretrained
language model on preference data collected on model-generated outputs, [Ouyang et al. 2022] trains a GPT-3 based
reward model on publicly available datasets. However, only a slight advantage was found over using pretrained language
models or supervised-fine-tuned models. Leveraging publicly available preference datasets (such as ranked answers
from StackOverflow), as suggested by [Askell et al. 2021], notably enhances reward model performance, especially for
smaller models and datasets.
Challenges in Reward Modeling The reward model is initially trained on a selected set of input prompts and
corresponding initial model outputs. As the model training progresses, it is crucial for the reward model to generalize
to new model outputs and potentially new input prompts. We refer readers to Section 5.1 for a deeper theoretical
exploration of this aspect.
Regarding the generalization capabilities of reward models, Ouyang et al. [2022] presents findings that demonstrate
high generalization to held-out test labelers. This capability is of paramount importance since a majority of the inputs
encountered during language model training would be out-of-distribution w.r.t. the reward model training phase.
Generalization capability depends on various factors such as the dataset’s size, the amount of noise in the feedback
dataset, and the characteristics of the pretrained reward model.
Moreover, the robustness and calibration of the reward models with respect to actual human preferences are essential
for their effectiveness. A well-calibrated reward model should accurately predict the probability of a human preferring
one output over another. Bai et al. [2022a] discovered that when training solely on a helpfulness feedback dataset, their
model exhibits strong calibration. However, when trained on a mixture of helpfulness and harmlessness datasets, the
model is underconfident in its predictions. To assess robustness, a common practice involves evaluating the policy
model trained using the reward model.
To assess robustness, a common practice involves evaluating the policy model trained using the reward model. Interest-
ingly, Bai et al. [2022a] discerned that smaller reward models and higher rewards correlate with decreased robustness.
This phenomenon arises from the reward model’s initial training on model outputs with naturally low rewards. To
address this distribution shift, an approach involving iterated training of the reward model is proposed (see Section 7.6).
In summation, the discussion underscores that the trained reward model on preferences is an imperfect proxy of human
feedback, especially in out-of-domain cases.
Moving Beyond Scalar Rewards Apart from providing a single scalar reward at the end of a trajectory (complete
text output), several methods model a more fine-grained approach. [Uesato et al. 2022; Korbak et al. 2023] provides
a segment-level reward during training, a method also known as process supervision. Interestingly, while [Uesato
et al. 2022] did not find any major downstream performance improvement with their method, [Lightman et al. 2023]
used similar methodology but instead trained larger models on a larger feedback dataset coupled with evaluation on a
more difficult task found segment-level feedback to be significantly more useful. [Scheurer et al. 2023] uses language
feedback from another LLM that implicitly acts like a reward model for the training of the policy model.
While ideally, as discussed in Section 4, the reward model provides a dual-purpose reward taking into account both the
task information (eg, summarization task) and the task-specific evaluation ((a condescending summary is rewarded
less than a neutral summary). However, diversifying the approach, some strategies involve the use of multiple reward
models, each specializing in distinct characteristics or specific tasks. [Wu et al. 2023c; Ramé et al. 2023] demonstrate
the efficacy of training separate reward models for specific attributes such as coherency and factuality. Similarly, [Glaese
et al. 2022] introduces two reward models—one for preference and another for rule violation in dialogue generation.
They found using two models over one to be more effective, likely because of a smaller feedback dataset. Further,
since the preference-based reward model provides a delayed reward (reward is provided only at the end of the whole
trajectory), the A2C algorithm, when used for sequence modeling [Bahdanau et al. 2016] proposes potential-based
reward shaping, where intermediate generations (states) are also rewarded.
21
In conclusion, the reward modeling process is a critical component of RLHF which involves the training of a model
to emulate human feedback, thereby acting as a surrogate for the true reward function. The size, initialization, and
generalization capabilities of the reward model are all crucial factors that influence its performance. The reward model
must be robust, well-calibrated, and additionally can provide more fine-grained feedback to the policy model training.
The trained reward model is utilized for finetuning the language model. Framing the task as reinforcement learning, with
the language model as the policy, algorithms such as Proximal Policy Optimization (PPO) and Advantage Actor-Critic
(A2C) [Schulman et al. 2017; Bahdanau et al. 2016] are used to update the parameters of the language model such
that the generated outputs maximize the obtained reward. These are gradient-based methods, called policy-gradient
algorithms, that directly update the parameters of the policy using the evaluative reward feedback The following sections
primarily focus on the widely used Proximal Policy Optimization (PPO) Algorithm, while the same concepts are
applicable to other candidate algorithms [Ouyang et al. 2022].
22
compare different variants of RL algorithms at a fixed KL distance from the initial model, with the aim of maximizing
the reward with the lowest possible KL divergence.
Limitations in Current RLHF practices Despite the impressive results achieved by RLHF in practice, it is an
unstable training process [Choshen et al. 2019]. Moreover, it is highly sensitive to hyperparameters, necessitating a
significant amount of hyperparameter tuning [Rafailov et al. 2023; Yuan et al. 2023]. Furthermore, the generalization
capabilities of RLHF and other issues, such as underperformance on metrics not captured by the reward model, warrant
further investigation. A comprehensive examination of these aspects is discussed in Section 7.7.
Fine-tuning models using Reinforcement Learning from Human Feedback (RLHF) showcase a remarkable ability to
align with human preferences and generalize to new scenarios and is more sample-efficient than supervised fine-tuning.
Nonetheless, these models exhibit characteristics and behaviors that warrant careful consideration, prompting the need
for further exploration and refinement.
Alignment Capabilities One intriguing property, referred to as the Alignment Tax, was identified by [Ouyang et al.
2022]. The phenomenon reveals that RLHF-trained chat models sometimes perform poorly compared to initial policy in
downstream tasks, suggesting a cost linked to aligning human preferences. To mitigate this, they propose incorporating
the pre-training objective into RLHF-finetuning, which substantially reduces the Alignment Tax. Moreover, [Bai et al.
2022a] indicates that larger models tend to exhibit lower alignment tax. [Bai et al. 2022a] also observed that RLHF
models better align with human preferences as the scales of both the reward model and policy model increase. It is
noteworthy, however, that a similar scaling effect could be seen in instruction-finetuned SFT models. A comprehensive
comparison of the scaling effects on RLHF versus SFT models is currently lacking in the literature and would make for
an intriguing future study.
Generalization Capabilities RLHF models have exhibited impressive generalization capabilities beyond their training
data, including generalization on new prompts and human feedback. For instance, [Ouyang et al. 2022] demonstrates
RLHF-tuned models answering coding questions and following instructions in multiple languages despite being
finetuned only in English and with limited code-related prompts. This suggests that the majority of a language model’s
capabilities are acquired during pre-training, and RLHF merely aligns these capabilities to elicit desired behavior.
However, this generalization can be a double-edged sword, potentially leading to undesirable outcomes, especially when
the feedback signal is sparse. For instance, the initial LLaMA2 Chat Model5 , when prompted "How to kill a process?"
refused to answer, drawing ethical concerns, though the intended answer was about terminating a computer process.
This behavior likely stems from the model’s extended generalization from examples that trained it to reject violent
queries. The example further highlights the problems of imperfect rewards leading to misgeneralization, as discussed in
Section 6.2. Further, a distributional shift between prompts used for reward model finetuning and RLHF training can
result in the policy model misaligning with human preferences [Bai et al. 2022a]. Further, during RL training, outputs
are sampled from the language model, which is evaluated using the reward model. However, deviations in parameters
used for sampling outputs from the model during inference from those in training can yield poor results [Ramamurthy
et al. 2022].
Diversity & Biases of RLHF model outputs Another characteristic of RLHF models is their low entropy in output
distribution [Bai et al. 2022a], which challenges generating diverse responses [Kirk et al. 2023]. This holds true for
both seen and unseen datasets. To address this, entropy regularization techniques are introduced [Jaques et al. 2019; Li
et al. 2016] to amplify diversity in the action space, albeit not always resolving the issue [Raichuk et al. 2021]. While
not conclusive, [Bai et al. 2022a] found that while RLHF models exhibit better sentiment towards all classes, they
5
https://huggingface.co/meta-llama/Llama-2-70b-chat-hf
23
display similar biases to underlying LLMs when sampling with temperature < 1 (i.e., with low diversity samples). This
could be attributed to their lower entropy. Furthermore, while pre-trained models often generate probabilities that are
well-calibrated, RLHF models may lose this calibration. For instance, [OpenAI 2023] found that for pre-trained GPT-4,
the probability of generating an answer is often directly proportional to the probability of it being correct. However, in
the case of RLHF models, the distribution is skewed towards more likely answers.
Objective Misalignment While RLHF aims to align language models with human preferences and intentions, reward
model misalignment is frequently possible. For instance, Singhal et al. [2023] finds that reward models provide higher
rewards to longer outputs. Further,
Robustness and Safety It is imperative to note that the reward model is merely an imperfect proxy for real human
preferences/feedback. Due to the lack of calibration and robustness of reward models [Bai et al. 2022a], over-optimizing
against the reward model can render it an ineffective measure (Goodhart’s Law). This phenomenon, known as Reward
Overoptimization, has been studied in the context of language models by [Gao et al. 2022; Coste et al. 2023].
Further, training RLHF models in practice is very difficult for practitioners owing to unstable training [Choshen et al.
2019], hyperparameter sensitivity [Yuan et al. 2023; Rafailov et al. 2023], loading multiple models leading to high
memory usage [Santacroce et al. 2023]. As a result, there have been significant efforts to simplify the training process
by learning directly from the available feedback using simpler supervised finetuning objectives, as we discuss in
Section 7.9.
In conclusion, while RLHF substantially enhances the performance of LLMs and aligns them with human preferences,
it is not without its limitations. These include, but are not limited to, issues such as text hallucination [McKenna et al.
2023], bias and toxicity [Deshpande et al. 2023a; Ferrara 2023; Gupta et al. 2023], and the generation of harmful text
when probed [Perez et al. 2022; Wei et al. 2023]. Despite significant improvements, these models are not fully aligned
with human preferences, underscoring the need for continued research and development in this field.
Challenges with Sparse Rewards in Traditional RL Reinforcement learning (RL) has conventionally employed
delayed and sparse rewards, where agents receive scalar feedback at the end of a trajectory or episode [Sutton and
Barto 2005]. While this approach is straightforward to implement and aligns with the task objective, it is not without its
drawbacks. Sparse rewards can lead to sample-inefficient learning due to extensive exploration requirements [Bellemare
et al. 2016]. Additionally, they may result in reward hacking, where agents exploit unintended strategies to maximize
rewards without solving the intended task [Ibarz et al. 2018]. Underspecified rewards, which do not fully capture the
desired behavior, can also yield suboptimal or degenerate solutions [Hadfield-Menell et al. 2017].
Enriching Reward Signals To mitigate the limitations of sparse rewards, researchers have explored various methods
for providing richer feedback in environments with inherently sparse rewards. These approaches include reward
shaping, where the original reward signal is augmented with additional feedback [Ng et al. 1999; Grzes 2017]; intrinsic
motivation, which encourages exploration and learning through internal rewards based on novelty, curiosity, or learning
progress [Oudeyer et al. 2007; Bellemare et al. 2016; Pathak et al. 2017]; and multi-objective optimization with
multiple reward signals [Roijers et al. 2013; Roijers 2016]. Hierarchical RL, which decomposes complex tasks into
simpler subtasks with their own reward structures, has also been investigated [Dietterich 1999; Barto and Mahadevan
2003]. Moreover, richer forms of feedback, such as learning from corrections [Jain et al. 2015; Bajcsy et al. 2017],
demonstrations [Rengarajan et al. 2022], and language feedback [Matuszek et al. 2012; Fried et al. 2017], have proven
beneficial.
Implications for RLHF in LLMs Current RLHF pipelines for LLMs primarily rely on sparse rewards provided at
the end of an episode, with reward models trained using sparse preference-based feedback. Similar challenges observed
in traditional RL have also been identified in RLHF-tuned LLMs. Some progress has been made in learning from
feedback for multi-objective optimization [Ramé et al. 2023], language feedback [Scheurer et al. 2022], corrective
feedback [Madaan et al. 2023; Shinn et al. 2023], and denser rewards [Wu et al. 2023c]. Future research should explore
the integration of these techniques to address the unique challenges in training LLMs with RLHF, potentially improving
generalization and robustness.
While RLHF has been very successful, it still results in unstable training [Choshen et al. 2019], is hyperparameter
sensitive [Yuan et al. 2023; Rafailov et al. 2023], has high memory usage [Santacroce et al. 2023] making it difficult
24
for practitioners to actually use it. As a result, there have been significant efforts to simplify the training process by
learning directly from the available feedback using simpler supervised finetuning objectives.
Once, a reward model is trained, it is not necessary to perform the RLHF-based training. Instead, an alternate approach
during inference is to sample multiple outputs from the LLM and rank them using the reward model [Nakano et al.
2021; Cobbe et al. 2021]. This is also called best-on-n sampling or rejection sampling. If sampling multiple outputs, it
is important to ensure diversity of outputs by adjusting the sampling parameters (such as higher temperature). This
approach is often considered as either a baseline or augmented with RLHF-trained models for better inference-time
results.
Further, various works [Dong et al. 2023; Yuan et al. 2023; Song et al. 2023] use the trained reward model to rank
multiple responses and use the signal from the ranked responses to train the policy model, without using an elaborate
RL algorithm. In another line of work, RAD [Deng and Raffel 2023] uses weighted-decoding of tokens at inference,
based on a separately trained reward model.
In this section, we discuss alternative methods to align language models with human feedback that do not rely on reward
models. While RLHF-PPO has shown promising results, it suffers from sensitivity to hyperparameters, the need for
training additional models, and potential misalignment of the reward model [Wu et al. 2023c; Rafailov et al. 2023;
Pang et al. 2022; Zhu et al. 2023b; Singhal et al. 2023]. To address these issues, recent research has explored various
techniques that directly incorporate human feedback into the training process, without relying on additional reward
models.
A straightforward approach is supervised fine-tuning on positive demonstrations from human feedback, such as
instruction-finetuned models [Ouyang et al. 2022; Chiang et al. 2023; Zhou et al. 2023]. However, this method does not
utilize negative feedback, which is crucial for training robust models that can handle adversarial situations and identify
and correct errors.
Recent works, such as [Liu et al. 2023a; Zhang et al. 2023], provide both positive and negative demonstrations/feedback
and maximize the likelihood of generating positive/preferred output. These methods have shown better performance
than RLHF methods on summarization and dialogue tasks. Zhao et al. [2023] demonstrate that Sequence Likelihood
calibration (SLiC) [Zhao et al. 2022] can be used to train models on off-policy offline data collected for different models,
resulting in better performance than RLHF-based methods on summarization tasks. SLiC uses a ranking calibration
loss that contrasts positive and negative sequences while motivating the model to predict the positive class. Further,
RSO [Liu et al. 2023c] improves policy learning in SLiC by using statistical rejection sampling from the policy.
Rafailov et al. [2023]; Azar et al. [2023] further reformulate the objective encoded in the RLHF PPO algorithm, and
train the model directly on the new objective, without the need for a separate reward model. This follows the intuition,
that the policy model can be implicitly used as a reward model for training itself based on the collected feedback.
However, the results are preliminary, and extending to out-of-distribution prompts may not be possible without the
introduction of an explicit reward model.
Another line of research focuses on refining model-generated responses using human-encoded principles or feedback.
[Bai et al. 2022b; Kundu et al. 2023] propose a framework where a list of human-encoded principles (Constitution) guide
the model to critique its generations and self-refine the responses. The model is then fine-tuned on the refined responses.
Self-Align [Sun et al. 2023] follows a similar procedure but further removes the need to start with an RLHF-finetuned
model. They fine-tune the pretrained LLaMA [Touvron et al. 2023a] base model using less than 300 lines of human
feedback (in the form of constitutional principles) and achieve performance comparable to state-of-the-art models in
terms of helpfulness and harmlessness.
Another direction of work learns to generate or select good feedback for model outputs and apply it to refine language
model outputs. [Scheurer et al. 2022] takes a similar refinement approach but utilizes available summarization feedback.
The initial model is conditioned on input, feedback, and output, generating multiple refinements. The model is then
fine-tuned on refinements with the highest similarity to human feedback. [Liu et al. 2023b] aligns human moral values
by modeling DP (dynamic-programming) based edits from unaligned source text to target aligned text. The model is
then fine-tuned on the refinements generated by the edits, using RL for the second part of the process. [Xu et al. 2022]
fine-tune a dialogue model using multi-modal feedback with the DIRECTOR method [Arora et al. 2022], which models
both negative and positive sequence labeling directly in the language model head.
25
In summary, these alternative methods generate new data based on feedback or guidelines and then use it to fine-tune
the model. These approaches reduce the reliance on reward models and have shown promising results in some tasks,
making them a viable alternative to RLHF-PPO. While these models are easier to train and help in alleviating many
drawbacks of RLHF, the evaluation performed has been performed only on specific domains, and constrained settings.
Moreover, other in-depth analysis such as sample efficiency and properties exhibited by these models, especially on
out-of-distribution data needs to be explored further.
Acknowledgements We thank Khanh Nguyen for extensive and insightful feedback on earlier versions of the draft.
We also thank Wenlong Zhao, Tuhina Tripathi, and Abhiman Neelakanteswara for their help with improving the clarity
of the manuscript.
References
Pranjal Aggarwal, A. Deshpande, and Karthik Narasimhan. Semsup-xc: Semantic supervision for zero and few-
shot extreme classification. In International Conference on Machine Learning, 2023. URL https://api.
semanticscholar.org/CorpusID:256274863.
Anirudh Ajith, Chris Pan, Mengzhou Xia, A. Deshpande, and Karthik Narasimhan. Instructeval: Systematic evaluation
of instruction selection methods. ArXiv, abs/2307.00259, 2023. URL https://api.semanticscholar.org/
CorpusID:259316853.
Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune,
Ali Siahkoohi, and Richard G. Baraniuk. Self-consuming generative models go mad, 2023.
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh
Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information
processing systems, 30, 2017.
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Tachard Passos, Siamak
Shakeri, Emanuel Taropa, Paige Bailey, Z. Chen, Eric Chu, J. Clark, Laurent El Shafey, Yanping Huang, Kathleen S.
Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan
Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hern’andez ’Abrego, Junwhan Ahn, Jacob Austin, Paul Barham,
26
Jan A. Botha, James Bradbury, Siddhartha Brahma, Kevin Michael Brooks, Michele Catasta, Yongzhou Cheng,
Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, C Crépy, Shachi Dave, Mostafa Dehghani,
Sunipa Dev, Jacob Devlin, M. C. D’iaz, Nan Du, Ethan Dyer, Vladimir Feinberg, Fan Feng, Vlad Fienber, Markus
Freitag, Xavier García, Sebastian Gehrmann, Lucas González, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou,
Joshua Howland, An Ren Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski,
Wen Hao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric
Li, Mu-Li Li, Wei Li, Yaguang Li, Jian Li, Hyeontaek Lim, Han Lin, Zhong-Zhong Liu, Frederick Liu, Marcello
Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric
Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily
Reif, Bryan Richter, Parker Riley, Alexandra Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Marie Shelby,
Ambrose Slone, Daniel Smilkov, David R. So, Daniela Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan,
Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Ke Xu, Yu Xu,
Lin Wu Xue, Pengcheng Yin, Jiahui Yu, Qiaoling Zhang, Steven Zheng, Ce Zheng, Wei Zhou, Denny Zhou, Slav
Petrov, and Yonghui Wu. Palm 2 technical report. 2023.
Kushal Arora, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. Director: Generator-classifiers for supervised
language modeling. In AACL, 2022.
Saurabh Arora and Prashant Doshi. A survey of inverse reinforcement learning: Challenges, methods and progress.
Artificial Intelligence, 297:103500, 2021.
Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. Deep reinforcement learning: A
brief survey. IEEE Signal Processing Magazine, 34:26–38, 2017. URL https://api.semanticscholar.org/
CorpusID:4884302.
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, T. J. Henighan, Andy Jones, Nicholas Joseph,
Benjamin Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, John Kernion, Kamal
Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Christopher Olah, and
Jared Kaplan. A general language assistant as a laboratory for alignment. ArXiv, abs/2112.00861, 2021.
Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi
Munos. A general theoretical paradigm to understand learning from human preferences, 2023.
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron C. Courville, and
Yoshua Bengio. An actor-critic algorithm for sequence prediction. ArXiv, abs/1607.07086, 2016.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort,
Deep Ganguli, T. J. Henighan, Nicholas Joseph, Saurav Kadavath, John Kernion, Tom Conerly, Sheer El-Showk,
Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt,
Neel Nanda, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Christopher Olah,
Benjamin Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from
human feedback. ArXiv, abs/2204.05862, 2022a.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, John Kernion, Andy Jones, Anna Chen, Anna Goldie,
Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn
Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, E Perez, Jamie Kerr, Jared Mueller, Jeff Ladish, J Landau, Kamal
Ndousse, Kamilė Lukosiūtė, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noem’i Mercado,
Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk,
Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, T. J. Henighan, Tristan Hume, Sam Bowman,
Zac Hatfield-Dodds, Benjamin Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom B. Brown, and Jared
Kaplan. Constitutional ai: Harmlessness from ai feedback. ArXiv, abs/2212.08073, 2022b.
Alexandre Bailly, Corentin Blanc, Élie Francis, Thierry Guillotin, Fadi Jamal, Béchara Wakim, and Pascal Roy. Effects
of dataset size and interactions on the prediction performance of logistic regression and deep learning models.
Computer Methods and Programs in Biomedicine, 213:106504, 2022.
Andrea V. Bajcsy, Dylan P. Losey, Marcia Kilchenman O’Malley, and Anca D. Dragan. Learning robot objectives
from physical human interaction. In Conference on Robot Learning, 2017. URL https://api.semanticscholar.
org/CorpusID:28406224.
Peter Barnett, Rachel Freedman, Justin Svegliato, and Stuart Russell. Active reward learning from multiple teachers.
arXiv preprint arXiv:2303.00894, 2023.
Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event
Dynamic Systems, 13:41–77, 2003. URL https://api.semanticscholar.org/CorpusID:386824.
27
Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Rémi Munos. Unifying
count-based exploration and intrinsic motivation. In NIPS, 2016. URL https://api.semanticscholar.org/
CorpusID:8310565.
Dimitri Bertsekas and John N Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.
Christopher Bishop. Pattern recognition and machine learning. Springer google schola, 2:531–537, 2006.
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement
learning. ArXiv, abs/2305.13301, 2023.
Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired
comparisons. Biometrika, 39:324, 1952.
Daniel S Brown and Scott Niekum. Deep bayesian reward learning from preferences. arXiv preprint arXiv:1912.04472,
2019.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J.
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark
Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. ArXiv, abs/2005.14165,
2020.
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman,
Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll,
Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max
Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem
Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, and Dylan Hadfield-Menell. Open problems and fundamental
limitations of reinforcement learning from human feedback, 2023.
Angelica Chen. Improving code generation by training with natural language feedback. ArXiv, abs/2303.16749, 2023.
URL https://api.semanticscholar.org/CorpusID:257804798.
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao
Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with
90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
Leshem Choshen, Lior Fox, Zohar Aizenbud, and Omri Abend. On the weaknesses of reinforcement learning for neural
machine translation. ArXiv, abs/1907.01752, 2019.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,
Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua
Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du,
Benton C. Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier García, Vedant Misra,
Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander
Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankara-
narayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee,
Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Díaz, Orhan Firat, Michele Catasta, Jason Wei, Kathleen S.
Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with
pathways. ArXiv, abs/2204.02311, 2022.
Paul Francis Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement
learning from human preferences. ArXiv, abs/1706.03741, 2017.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry
Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math
word problems. ArXiv, abs/2110.14168, 2021.
Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimiza-
tion, 2023.
Haikang Deng and Colin Raffel. Reward-augmented decoding: Efficient controlled text generation with a unidirectional
reward model, 2023.
A. Deshpande, Vishvak Murahari, Tanmay Rajpurohit, A. Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Ana-
lyzing persona-assigned language models. ArXiv, abs/2304.05335, 2023a. URL https://api.semanticscholar.
org/CorpusID:258060002.
28
Ameet Deshpande, Tanmay Rajpurohit, Karthik Narasimhan, and Ashwin Kalyan. Anthropomorphization of ai:
Opportunities and risks. arXiv preprint arXiv:2305.14784, 2023b.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. ArXiv, abs/1810.04805, 2019.
Thomas G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. ArXiv,
cs.LG/9905014, 1999. URL https://api.semanticscholar.org/CorpusID:57341.
Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and T. Zhang. Raft:
Reward ranked finetuning for generative foundation model alignment. ArXiv, abs/2304.06767, 2023.
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi
Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma
Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen S. Meier-Hellstern, Toju Duke, Lucas Dixon, Kun
Zhang, Quoc V. Le, Yonghui Wu, Z. Chen, and Claire Cui. Glam: Efficient scaling of language models with
mixture-of-experts. ArXiv, abs/2112.06905, 2021.
Esin Durmus, Karina Nyugen, Thomas Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac
Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin,
Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. Towards measuring the representation of subjective
global opinions in language models. ArXiv, abs/2306.16388, 2023. URL https://api.semanticscholar.org/
CorpusID:259275051.
Xiang Fan, Yiwei Lyu, Paul Pu Liang, Ruslan Salakhutdinov, and Louis-Philippe Morency. Nano: Nested human-in-
the-loop reward learning for few-shot language model control. ArXiv, abs/2211.05750, 2022.
Patrick Fernandes, António Farinhas, Ricardo Rei, José G. C. de Souza, Perez Ogayo, Graham Neubig, and André F. T.
Martins. Quality-aware decoding for neural machine translation. ArXiv, abs/2205.00978, 2022.
Patrick Fernandes, Aman Madaan, Emmy Liu, António Farinhas, Pedro Henrique Martins, Amanda Bertsch, José
G. C. de Souza, Shuyan Zhou, Tongshuang Sherry Wu, Graham Neubig, and André F. T. Martins. Bridging the gap:
A survey on integrating (human) feedback for natural language generation. ArXiv, abs/2305.00955, 2023. URL
https://api.semanticscholar.org/CorpusID:258426970.
Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. ArXiv, abs/2304.03738,
2023. URL https://api.semanticscholar.org/CorpusID:258041203.
Daniel Fried, Jacob Andreas, and Dan Klein. Unified pragmatic models for generating and following instruc-
tions. In North American Chapter of the Association for Computational Linguistics, 2017. URL https:
//api.semanticscholar.org/CorpusID:21015570.
Deep Ganguli, Liane Lovitt, John Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Benjamin Mann, Ethan
Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma,
Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zachary Dodds, T. J. Henighan, Danny Hernandez,
Tristan Hume, Josh Jacobson, Scurl = https://lilianweng.github.io/posts/2018-04-08-policy-gradient/nd Jared Kaplan,
and Jack Clark. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.
ArXiv, abs/2209.07858, 2022. URL https://api.semanticscholar.org/CorpusID:252355458.
Ge Gao, Hung-Ting Chen, Yoav Artzi, and Eunsol Choi. Continually improving extractive qa via human feedback.
ArXiv, abs/2305.12473, 2023.
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. ArXiv, abs/2210.10760,
2022.
Amelia Glaese, Nathan McAleese, Maja Trkebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh,
Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang,
Ramona Comanescu, Fan Yang, A. See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez
Elias, Richard Green, Sovna Mokr’a, Nicholas Fernando, Boxi Wu, Rachel Foley, Susannah Young, Iason Gabriel,
William S. Isaac, John F. J. Mellor, Demis Hassabis, Koray Kavukcuoglu, Lisa Anne Hendricks, and Geoffrey Irving.
Improving alignment of dialogue agents via targeted human judgements. ArXiv, abs/2209.14375, 2022.
Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. Continuous measurement scales in human
evaluation of machine translation. In LAWACL, 2013.
Marek Grzes. Reward shaping in episodic reinforcement learning. In Adaptive Agents and Multi-Agent Systems, 2017.
URL https://api.semanticscholar.org/CorpusID:2093019.
Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song.
The false promise of imitating proprietary llms, 2023.
29
Shashank Gupta, Vaishnavi Shrivastava, Ameet Deshpande, Ashwin Kalyan, Peter Clark, Ashish Sabharwal, and Tushar
Khot. Bias runs deep: Implicit reasoning biases in persona-assigned llms. arXiv preprint arXiv:2311.04892, 2023.
Dylan Hadfield-Menell, Smitha Milli, P. Abbeel, Stuart J. Russell, and Anca D. Dragan. Inverse reward design. ArXiv,
abs/1711.02827, 2017. URL https://api.semanticscholar.org/CorpusID:3805733.
Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazaré, and Jason Weston. Learning from dialogue after
deployment: Feed yourself, chatbot! In Annual Meeting of the Association for Computational Linguistics, 2019.
Austin W. Hanjie, A. Deshpande, and Karthik Narasimhan. Semsup: Semantic supervision for simple and scalable
zero-shot generalization. 2022. URL https://api.semanticscholar.org/CorpusID:255595954.
Danny Hernandez, Tom B. Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac
Hatfield-Dodds, T. J. Henighan, Tristan Hume, Scott Johnston, Benjamin Mann, Christopher Olah, Catherine Olsson,
Dario Amodei, Nicholas Joseph, Jared Kaplan, and Sam McCandlish. Scaling laws and interpretability of learning
from repeated data. ArXiv, abs/2205.10487, 2022.
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. ArXiv,
abs/1503.02531, 2015.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego
de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican,
George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W.
Rae, Oriol Vinyals, and L. Sifre. Training compute-optimal large language models. ArXiv, abs/2203.15556, 2022.
Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human
preferences and demonstrations in atari. ArXiv, abs/1811.06521, 2018. URL https://api.semanticscholar.
org/CorpusID:53424488.
Ashesh Jain, Shikhar Sharma, Thorsten Joachims, and Ashutosh Saxena. Learning preferences for manipulation tasks
from online coactive feedback. The International Journal of Robotics Research, 34:1296 – 1313, 2015. URL
https://api.semanticscholar.org/CorpusID:10851113.
Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Àgata Lapedriza, Noah J. Jones, Shix-
iang Shane Gu, and Rosalind W. Picard. Way off-policy batch deep reinforcement learning of implicit human
preferences in dialog. ArXiv, abs/1907.00456, 2019.
Adam Tauman Kalai and Santosh S Vempala. Calibrated language models must hallucinate. arXiv preprint
arXiv:2311.14648, 2023.
Jared Kaplan, Sam McCandlish, T. J. Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
Radford, Jeff Wu, and Dario Amodei. Scaling laws for neural language models. ArXiv, abs/2001.08361, 2020.
Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier. A survey of reinforcement learning from human
feedback. arXiv preprint arXiv:2312.14925, 2023.
Yaser Keneshloo, Tian Shi, Naren Ramakrishnan, and Chandan K. Reddy. Deep reinforcement learning for sequence-
to-sequence models. IEEE Transactions on Neural Networks and Learning Systems, 31:2469–2489, 2018.
Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, and Lukasz Kaiser. Sample efficient text summarization using a single
pre-trained transformer. ArXiv, abs/1905.08836, 2019.
Samuel Kiegeland and Julia Kreutzer. Revisiting the weaknesses of reinforcement learning for neural machine
translation. ArXiv, abs/2106.08942, 2021.
Sungdong Kim, Sanghwan Bae, Jamin Shin, Soyoung Kang, Donghyun Kwak, Kang Min Yoo, and Minjoon Seo.
Aligning large language models through synthetic feedback. ArXiv, abs/2305.13735, 2023.
Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and
Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity, 2023.
W. B. Knox and P. Stone. Tamer: Training an agent manually via evaluative reinforcement. 2008 7th IEEE International
Conference on Development and Learning, pages 292–297, 2008.
W Bradley Knox, Stephane Hatgis-Kessell, Serena Booth, Scott Niekum, Peter Stone, and Alessandro Allievi. Models
of human preference for learning reward functions. arXiv preprint arXiv:2206.02231, 2022.
Tomasz Korbak, Ethan Perez, and Christopher L. Buckley. Rl with kl penalties is better viewed as bayesian inference.
ArXiv, abs/2205.11275, 2022.
Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Sam Bowman, and
Ethan Perez. Pretraining language models with human preferences. ArXiv, abs/2302.08582, 2023.
30
Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Stefan Riezler. Can neural machine translation be improved
with user feedback? ArXiv, abs/1804.05958, 2018a.
Julia Kreutzer, Joshua Uyheng, and Stefan Riezler. Reliability and learnability of human bandit feedback for sequence-
to-sequence reinforcement learning. ArXiv, abs/1805.10627, 2018b.
Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital
Balwit, Azalia Mirhoseini, Brayden McLean, Catherine Olsson, Cassie Evraets, Eli Tran-Johnson, Esin Durmus,
Ethan Perez, Jackson Kernion, Jamie Kerr, Kamal Ndousse, Karina Nguyen, Nelson Elhage, Newton Cheng, Nicholas
Schiefer, Nova DasSarma, Oliver Rausch, Robin Larson, Shannon Yang, Shauna Kravec, Timothy Telleen-Lawton,
Thomas I. Liao, Tom Henighan, Tristan Hume, Zac Hatfield-Dodds, Sören Mindermann, Nicholas Joseph, Sam
McCandlish, and Jared Kaplan. Specific versus general principles for constitutional ai, 2023.
François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for faster transformers. arXiv
preprint arXiv:2109.04838, 2021.
Carolin (Haas) Lawrence and Stefan Riezler. Improving a neural semantic parser by counterfactual learning from
human bandit feedback. In Annual Meeting of the Association for Computational Linguistics, 2018.
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas
Carlini. Deduplicating training data makes language models better. In Annual Meeting of the Association for
Computational Linguistics, 2021.
Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. Deal or no deal? end-to-end learning
of negotiation dialogues. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing, pages 2443–2453, Copenhagen, Denmark, September 2017. Association for Computational Linguistics.
doi:10.18653/v1/D17-1259. URL https://aclanthology.org/D17-1259.
Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston. Dialogue learning with
human-in-the-loop. ArXiv, abs/1611.09823, 2016.
Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. Paraphrase generation with deep reinforcement learning. ArXiv,
abs/1711.00279, 2017.
Zichao Li, Prakhar Sharma, Xing Han Lu, Jackie Chi Kit Cheung, and Siva Reddy. Using interactive feedback to
improve the accuracy and explainability of question answering systems post-deployment. ArXiv, abs/2204.03025,
2022. URL https://api.semanticscholar.org/CorpusID:248006299.
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman,
Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. ArXiv, abs/2305.20050, 2023.
Rensis Likert. A technique for the measurement of attitudes. Archives of psychology, 1932.
Chin-Yew Lin and Franz Josef Och. Automatic evaluation of machine translation quality using longest common
subsequence and skip-bigram statistics. In Annual Meeting of the Association for Computational Linguistics, 2004.
Bing Liu, Gökhan Tür, Dilek Z. Hakkani-Tür, Pararth Shah, and Larry Heck. Dialogue learning with human teaching
and feedback in end-to-end trainable task-oriented dialogue systems. In North American Chapter of the Association
for Computational Linguistics, 2018.
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. How not to evaluate
your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation.
ArXiv, abs/1603.08023, 2016.
Hao Liu, Carmelo Sferrazza, and P. Abbeel. Chain of hindsight aligns language models with feedback. ArXiv,
abs/2302.02676, 2023a.
Ruibo Liu, Chenyan Jia, Ge Zhang, Ziyu Zhuang, Tony X. Liu, and Soroush Vosoughi. Second thoughts are best:
Learning to re-align with human values from text edits. ArXiv, abs/2301.00355, 2023b.
Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, and Jialu Liu. Statistical
rejection sampling improves preference optimization, 2023c.
Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):
225–331, 2009.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer,
and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
R. Duncan Luce. Individual choice behavior: A theoretical analysis. 1979.
31
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shri-
mai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh,
and Peter Clark. Self-refine: Iterative refinement with self-feedback. ArXiv, abs/2303.17651, 2023.
Andrei Andreevich Markov. The theory of algorithms. Trudy Matematicheskogo Instituta Imeni VA Steklova, 42:3–375,
1954.
Cynthia Matuszek, Nicholas FitzGerald, Luke Zettlemoyer, Liefeng Bo, and Dieter Fox. A joint model of language
and perception for grounded attribute learning. In International Conference on Machine Learning, 2012. URL
https://api.semanticscholar.org/CorpusID:2408319.
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan T. McDonald. On faithfulness and factuality in abstractive
summarization. ArXiv, abs/2005.00661, 2020.
Daniel McFadden. Econometric models of probabilistic choice. 1981.
Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, and Mark Steedman. Sources
of hallucination by large language models on inference tasks. ArXiv, abs/2305.14552, 2023. URL https://api.
semanticscholar.org/CorpusID:258865517.
Jincheng Mei, Wesley Chung, Valentin Thomas, Bo Dai, Csaba Szepesvari, and Dale Schuurmans. The role of baselines
in policy gradient optimization. Advances in Neural Information Processing Systems, 35:17818–17830, 2022.
Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah
Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nathan McAleese. Teaching language models to support
answers with verified quotes. ArXiv, abs/2203.11147, 2022.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver,
and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on
machine learning, pages 1928–1937. PMLR, 2016.
Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel
Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev,
Olivier Bachem, Daniel Jaymin Mankowitz, Doina Precup, and Bilal Piot. Nash learning from human feedback.
ArXiv, abs/2312.00886, 2023. URL https://api.semanticscholar.org/CorpusID:265609682.
Vishvak Murahari, Carlos E Jimenez, Runzhe Yang, and Karthik R Narasimhan. DataMUX: Data multiplexing
for neural networks. In Thirty-Sixth Conference on Neural Information Processing Systems, 2022. URL https:
//openreview.net/forum?id=UdgtTVTdswg.
Vishvak Murahari, Ameet Deshpande, Carlos E Jimenez, Izhak Shafran, Mingqiu Wang, Yuan Cao, and Karthik
Narasimhan. Mux-plms: Pre-training language models with data multiplexing. arXiv preprint arXiv:2302.12441,
2023.
Reiichiro Nakano, Jacob Hilton, S. Arun Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu
Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button,
Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human
feedback. ArXiv, abs/2112.09332, 2021.
P Nakkiran, G Kaplun, Y Bansal, et al. Deep double descent: Where bigger models and more data hurt. arxiv:
191202292 [cs, stat]. 2019.
A. Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transformations: Theory and application to
reward shaping. In International Conference on Machine Learning, 1999. URL https://api.semanticscholar.
org/CorpusID:5730166.
Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. arXiv
preprint arXiv:2209.00626, 2022.
Duy-Hung Nguyen, Nguyen-Viet-Dung Nghiem, Bao-Sinh Nguyen, Dung Tien Le, Shahab Sabahi, Minh Le Nguyen,
and Hung Le. Make the most of prior data: A solution for interactive text summarization with preference feedback.
In NAACL-HLT, 2022.
Khanh Nguyen, Hal Daumé, and Jordan L. Boyd-Graber. Reinforcement learning for bandit neural machine translation
with simulated human feedback. ArXiv, abs/1707.07402, 2017.
Khanh Nguyen, Dipendra Misra, Robert Schapire, Miro Dudík, and Patrick Shafto. Interactive learning from activity
description. In ICML, 2021.
Marcus O’Connor. Models of human behaviour and confidence in judgement: A review. International Journal of
Forecasting, 5:159–169, 1989.
32
OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2022. URL https://openai.com/blog/chatgpt.
OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
Pierre-Yves Oudeyer, F. Kaplan, and Verena V. Hafner. Intrinsic motivation systems for autonomous mental develop-
ment. IEEE Trans. Evol. Comput., 11:265–286, 2007. URL https://api.semanticscholar.org/CorpusID:
260429077.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens,
Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to
follow instructions with human feedback. ArXiv, abs/2203.02155, 2022.
Richard Yuanzhe Pang, Vishakh Padmakumar, Thibault Sellam, Ankur P. Parikh, and He He. Reward gaming in
conditional text generation. ArXiv, abs/2211.08714, 2022.
Mihir Parmar, Swaroop Mishra, Mor Geva, and Chitta Baral. Don’t blame the annotator: Bias already starts in the
annotation instructions. In Conference of the European Chapter of the Association for Computational Linguistics,
2022.
Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised
prediction. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages
488–489, 2017. URL https://api.semanticscholar.org/CorpusID:20045336.
Andi Peng, Besmira Nushi, Emre Kiciman, Kori Inkpen, and Ece Kamar. Investigations of performance and bias in
human-ai teamwork in hiring, 2022.
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nathan McAleese,
and Geoffrey Irving. Red teaming language models with language models. In Conference on Empirical Methods in
Natural Language Processing, 2022.
Robin L. Plackett. The analysis of permutations. Journal of The Royal Statistical Society Series C-applied Statistics,
24:193–202, 1975.
Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information
processing systems, 1, 1988.
Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018.
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah
Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard
Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes
Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John F. J. Mellor, Irina Higgins, Antonia Creswell,
Nathan McAleese, Amy Wu, Erich Elsen, Siddhant M. Jayakumar, Elena Buchatskaya, David Budden, Esme
Sutherland, Karen Simonyan, Michela Paganini, L. Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro,
Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau,
Maria Tsimpoukelli, N. K. Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Tobias Pohlen, Zhitao Gong,
Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan
Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew G. Johnson, Blake A. Hechtman,
Laura Weidinger, Iason Gabriel, William S. Isaac, Edward Lockhart, Simon Osindero, Laura Rimell, Chris Dyer,
Oriol Vinyals, Kareem W. Ayoub, Jeff Stanway, L. L. Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey
Irving. Scaling language models: Methods, analysis & insights from training gopher. ArXiv, abs/2112.11446, 2021.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct
preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv,
abs/1910.10683, 2019.
Anton Raichuk, Piotr Stanczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, L’eonard Hussenot, Matthieu
Geist, Olivier Pietquin, Marcin Michalski, and Sylvain Gelly. What matters for on-policy deep actor-critic
methods? a large-scale study. In International Conference on Learning Representations, 2021. URL https:
//api.semanticscholar.org/CorpusID:233340556.
Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In IJCAI, volume 7, pages 2586–2591,
2007.
Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage,
Hannaneh Hajishirzi, and Yejin Choi. Is reinforcement learning (not) for natural language processing?: Benchmarks,
baselines, and building blocks for natural language policy optimization. ArXiv, abs/2210.01241, 2022.
33
Alexandre Ramé, Guillaume Couairon, Mustafa Shukor, Corentin Dancette, Jean-Baptiste Gaya, Laure Soulier, and
Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse
rewards. 2023.
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent
neural networks. CoRR, abs/1511.06732, 2015.
Desik Rengarajan, Gargi Nikhil Vaidya, Akshay Sarvesh, Dileep M. Kalathil, and Srinivas Shakkottai. Reinforcement
learning with sparse rewards using guidance from offline demonstration. ArXiv, abs/2202.04628, 2022. URL
https://api.semanticscholar.org/CorpusID:246679865.
Diederik M. Roijers. Multi-objective decision-theoretic planning. 2016. URL https://api.semanticscholar.
org/CorpusID:124195290.
Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequen-
tial decision-making. ArXiv, abs/1402.0590, 2013. URL https://api.semanticscholar.org/CorpusID:
14478191.
Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, and Yelong Shen. Efficient rlhf: Reducing the memory usage of
ppo, 2023.
Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do
language models reflect?, 2023.
Rylan Schaeffer, Mikail Khona, Zachary Robertson, Akhilan Boopathy, Kateryna Pistunova, Jason W Rocks, Ila Rani
Fiete, and Oluwasanmi Koyejo. Double descent demystified: Identifying, interpreting & ablating the sources of a
deep learning puzzle. arXiv preprint arXiv:2303.14151, 2023.
J’er’emy Scheurer, Jon Ander Campos, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training
language models with language feedback. 2022.
J’er’emy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan
Perez. Training language models with language feedback at scale. ArXiv, abs/2303.16755, 2023.
Natalie Schluter. The limits of automatic summarisation according to rouge. In Conference of the European Chapter of
the Association for Computational Linguistics, 2017.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization
algorithms. ArXiv, abs/1707.06347, 2017.
Thibault Sellam, Dipanjan Das, and Ankur P. Parikh. Bleurt: Learning robust metrics for text generation. In Annual
Meeting of the Association for Computational Linguistics, 2020.
Shiqi Shen, Yong Cheng, Zhongjun He, W. He, Hua Wu, Maosong Sun, and Yang Liu. Minimum risk training for
neural machine translation. ArXiv, abs/1512.02433, 2015.
Zhan Shi, Xinchi Chen, Xipeng Qiu, and Xuanjing Huang. Toward diverse text generation with inverse reinforcement
learning. In International Joint Conference on Artificial Intelligence, 2018.
Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Lan-
guage agents with verbal reinforcement learning. 2023. URL https://api.semanticscholar.org/CorpusID:
258833055.
Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of
recursion: Training on generated data makes models forget, 2023.
David Silver, Satinder Singh, Doina Precup, and Richard S Sutton. Reward is enough. Artificial Intelligence, 299:
103535, 2021.
K. Singhal, Shekoofeh Azizi, Tao Tu, Said Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Kumar
Tanwani, Heather J. Cole-Lewis, Stephen J. Pfohl, P A Payne, Martin G. Seneviratne, Paul Gamble, Chris Kelly,
Nathaneal Scharli, Aakanksha Chowdhery, P. A. Mansfield, Blaise Aguera y Arcas, Dale R. Webster, Greg S. Corrado,
Yossi Matias, Katherine Hui-Ling Chou, Juraj Gottweis, Nenad Tomavsev, Yun Liu, Alvin Rajkomar, Joelle K.
Barral, Christopher Semturs, Alan Karthikesalingam, and Vivek Natarajan. Large language models encode clinical
knowledge. Nature, 620:172 – 180, 2022. URL https://api.semanticscholar.org/CorpusID:255124952.
Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in
rlhf, 2023.
Artem Sokolov, Stefan Riezler, and Tanguy Urvoy. Bandit structured prediction for learning from partial feedback in
statistical machine translation. ArXiv, abs/1601.04468, 2016.
34
Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking
optimization for human alignment, 2023.
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan J. Lowe, Chelsea Voss, Alec Radford, Dario Amodei,
and Paul Christiano. Learning to summarize from human feedback. ArXiv, abs/2009.01325, 2020.
Yushan Su, Vishvak Murahari, Karthik Narasimhan, and Kai Li. Prumux: Augmenting data multiplexing with model
compression. arXiv preprint arXiv:2305.14706, 2023.
Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David D. Cox, Yiming Yang, and Chuang
Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. ArXiv,
abs/2305.03047, 2023.
Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. IEEE Transactions on Neural
Networks, 16:285–286, 2005. URL https://api.semanticscholar.org/CorpusID:9166388.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement
learning with function approximation. Advances in neural information processing systems, 12, 1999.
R.S. Sutton. The reward hypothesis. 2004. URL http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/
rewardhypothesis.html.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aur’elien Rodriguez, Armand Joulin, Edouard Grave, and
Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023a.
Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya
Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj
Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor
Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut
Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra,
Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan
Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan,
Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien
Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat
models. 2023b.
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, L. Wang, Antonia Creswell, Geoffrey
Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback. ArXiv,
abs/2211.14275, 2022.
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and
Illia Polosukhin. Attention is all you need. In NIPS, 2017.
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? ArXiv,
abs/2307.02483, 2023. URL https://api.semanticscholar.org/CorpusID:259342528.
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and
Quoc V. Le. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021.
Lilian Weng. Policy gradient algorithms. lilianweng.github.io, 2018. URL https://lilianweng.github.io/
posts/2018-04-08-policy-gradient/.
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine
learning, 8:229–256, 1992.
Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nissan Stiennon, Ryan Lowe, Jan Leike, and Paul Francis Christiano.
Recursively summarizing books with human feedback. ArXiv, abs/2109.10862, 2021.
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur,
David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance. ArXiv, abs/2303.17564,
2023a. URL https://api.semanticscholar.org/CorpusID:257833842.
Tianhao Wu, Banghua Zhu, Ruoyu Zhang, Zhaojin Wen, Kannan Ramchandran, and Jiantao Jiao. Pairwise proximal
policy optimization: Harnessing relative feedback for llm alignment. ArXiv, abs/2310.00212, 2023b. URL https:
//api.semanticscholar.org/CorpusID:263334045.
Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf,
and Hanna Hajishirzi. Fine-grained human feedback gives better rewards for language model training. 2023c.
35
Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf,
and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. arXiv
preprint arXiv:2306.01693, 2023d.
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training
via structured pruning.
Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and accurate models. In Association
for Computational Linguistics (ACL), 2022.
Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma,
and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining. 2023.
Jing Xu, Megan Ung, Mojtaba Komeili, Kushal Arora, Y-Lan Boureau, and Jason Weston. Learning new skills after
deployment: Improving open-domain internet-driven dialogue with human feedback. ArXiv, abs/2208.03270, 2022.
Lixiang Yan, Lele Sha, Linxuan Zhao, Yuheng Li, Roberto Martinez-Maldonado, Guanliang Chen, Xinyu Li, Yueqiao
Jin, and Dragan Gašević. Practical and ethical challenges of large language models in education: A systematic
scoping review. British Journal of Educational Technology, August 2023. ISSN 1467-8535. doi:10.1111/bjet.13370.
URL http://dx.doi.org/10.1111/bjet.13370.
Kevin Yang, Nanyun Peng, Yuandong Tian, and Dan Klein. Re3: Generating longer stories with recursive reprompting
and revision. In Conference on Empirical Methods in Natural Language Processing, 2022a.
Ziqing Yang, Yiming Cui, and Zhigang Chen. TextPruner: A model pruning toolkit for pre-trained language models. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations,
pages 35–43, Dublin, Ireland, May 2022b. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-
demo.4. URL https://aclanthology.org/2022.acl-demo.4.
Georgios N. Yannakakis and John Hallam. Ranking vs. preference: A comparative study of self-reporting. In Affective
Computing and Intelligent Interaction, 2011. URL https://api.semanticscholar.org/CorpusID:48790.
Sanghyun Yi, Rahul Goel, Chandra Khatri, Tagyoung Chung, Behnam Hedayatnia, Anu Venkatesh, Raefer Gabriel,
and Dilek Z. Hakkani-Tür. Towards coherent and engaging spoken dialog response generation using automatic
conversation evaluators. In International Conference on Natural Language Generation, 2019.
Yichun Yin, Cheng Chen, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. AutoTinyBERT: Automatic hyper-
parameter optimization for efficient pre-trained language models. pages 5146–5157, 2021.
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Feiran Huang. Rrhf: Rank responses to
align language models with human feedback without tears. ArXiv, abs/2304.05302, 2023.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona
Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh
Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models.
ArXiv, abs/2205.01068, 2022.
Tianjun Zhang, Fangchen Liu, Justin Wong, P. Abbeel, and Joseph Gonzalez. The wisdom of hindsight makes language
models better instruction followers. ArXiv, abs/2302.05206, 2023.
Yao Zhao, Misha Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J. Liu. Calibrating sequence
likelihood improves conditional language generation. ArXiv, abs/2210.00045, 2022.
Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. Slic-hf: Sequence likelihood
calibration with human feedback. ArXiv, abs/2305.10425, 2023.
Rui Zheng, Shihan Dou, Songyang Gao, Wei Shen, Bing Wang, Yan Liu, Senjie Jin, Qin Liu, Limao Xiong, Luyao
Chen, Zhiheng Xi, Yuhao Zhou, Nuo Xu, Wen-De Lai, Minghao Zhu, Rongxiang Weng, Wen-Chun Cheng, Cheng
Chang, Zhangyue Yin, Yuan Long Hua, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu,
and Xuanjing Huang. Secrets of rlhf in large language models part i: Ppo. 2023.
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu,
Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment, 2023.
Wangchunshu Zhou and Ke Xu. Learning to compare for better training and evaluation of open domain natural language
generation models. In AAAI Conference on Artificial Intelligence, 2020.
Banghua Zhu, Jiantao Jiao, and M.I. Jordan. Principled reinforcement learning with human feedback from pairwise or
k-wise comparisons. ArXiv, abs/2301.11270, 2023a.
Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, M.I. Jordan, and Jiantao Jiao.
Fine-tuning language models with advantage-induced policy alignment. ArXiv, abs/2306.02231, 2023b.
36
Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement
learning. In AAAI Conference on Artificial Intelligence, 2008.
Daniel M. Ziegler, Nisan Stiennon, Jeff Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey
Irving. Fine-tuning language models from human preferences. ArXiv, abs/1909.08593, 2019.
37