0% found this document useful (0 votes)
48 views21 pages

RLHF - Reinforcement Learning From Human Feedback

The document discusses Reinforcement Learning from Human Feedback (RLHF) and its role in the development of models like ChatGPT. It outlines the three phases of ChatGPT development: pretraining, supervised finetuning, and RLHF, emphasizing the importance of human feedback in optimizing model responses. Although currently limited to a few key players, RLHF is expected to gain wider adoption in the future as the demand for better AI products grows.

Uploaded by

lry89757
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views21 pages

RLHF - Reinforcement Learning From Human Feedback

The document discusses Reinforcement Learning from Human Feedback (RLHF) and its role in the development of models like ChatGPT. It outlines the three phases of ChatGPT development: pretraining, supervised finetuning, and RLHF, emphasizing the importance of human feedback in optimizing model responses. Although currently limited to a few key players, RLHF is expected to gain wider adoption in the future as the demand for better AI products grows.

Uploaded by

lry89757
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

Chip Huyen Blog Books Events AI Guide  List 100 VN

RLHF: Reinforcement Learning from


Human Feedback
May 2, 2023 • Chip Huyen

[LinkedIn discussion, Twitter thread]

In literature discussing why ChatGPT is able to capture so much of our imagination, I often
come across two narratives:

1. Scale: throwing more data and compute at it.


2. UX: moving from a prompt interface to a more natural chat interface.

One narrative that is often glossed over is the incredible technical creativity that went into
making models like ChatGPT work. One such cool idea is RLHF (Reinforcement Learning from
Human Feedback): incorporating reinforcement learning and human feedback into NLP.

RL has been notoriously difficult to work with, and therefore, mostly confined to gaming and
simulated environments like Atari or MuJoCo. Just five years ago, both RL and NLP were
progressing pretty much orthogonally – different stacks, different techniques, and different
experimentation setups. It’s impressive to see it work in a new domain at a massive scale.

So, how exactly does RLHF work? Why does it work? This post will discuss the answers to those
questions.

To understand RLHF, we first need to understand the process of training a model like ChatGPT
and where RLHF fits in, which is the focus of the first section of this post. The following 3
sections cover the 3 phases of ChatGPT development. For each phase, I’ll discuss the goal for
that phase, the intuition for why this phase is needed, and the corresponding mathematical
formulation for those who want to see more technical detail.

Currently, RLHF is not yet widely used in the industry except for a few big key players –
OpenAI, DeepMind, and Anthropic. However, I’ve seen many work-in-progress efforts using
RLHF, so I wouldn’t be surprised to see RLHF used more in the future.

In this post, I assume that readers don’t have specialized knowledge in NLP or RL. If you do, feel
free to skip any section that is less relevant for you.
Table of Contents 

https://huyenchip.com/2023/05/02/rlhf.html 1/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

RLHF overview
Let’s visualize the development process for ChatGPT to see where RLHF fits in.

If you squint, this above diagram looks very similar to the meme Shoggoth with a smiley face.

1. The pretrained model is an untamed monster because it was trained on indiscriminate


data scraped from the Internet: think clickbait, misinformation, propaganda, conspiracy
theories, or attacks against certain demographics.
2. This monster was then finetuned on higher quality data – think StackOverflow, Quora, or
human annotations – which makes it somewhat socially acceptable.
3. Then the finetuned model was further polished using RLHF to make it customer-
appropriate, e.g. giving it a smiley face.

https://huyenchip.com/2023/05/02/rlhf.html 2/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

Shoggoth with Smiley Face. Courtesy of twitter.com/anthrupad

You can skip any of the three phases. For example, you can do RLHF directly on top of the
pretrained model, without going through the SFT phase. However, empirically, combining all
these three steps gives the best performance.

Pretraining is the most resource-intensive phase. For the InstructGPT model, pretraining takes
up 98% of the overall compute and data resources. You can think of SFT and RLHF as unlocking
the capabilities that the pretrained model already has but are hard for users to access via
prompting alone.

Teaching machines to learn from human preferences is not new. It’s been around for over a
decade. OpenAI started exploring learning from human preference back when their main bet
was robotics. The then narrative was that human preference was crucial for AI safety. However,
as it turned out, human preference can also make for better products, which attracted a much
larger audience.

»»Side note: The abstract from OpenAI’s learning from human preference paper in 2017««

One step towards building safe AI systems is to remove the need for humans to write goal
functions, since using a simple proxy for a complex goal, or getting the complex goal a bit
wrong, can lead to undesirable and even dangerous behavior. In collaboration with

https://huyenchip.com/2023/05/02/rlhf.html 3/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

DeepMind’s safety team, we’ve developed an algorithm which can infer what humans want by
being told which of two proposed behaviors is better.

Phase 1. Pretraining for completion


The result of the pretraining phase is a large language model (LLM), often known as the
pretrained model. Examples include GPT-x (OpenAI), Gopher (DeepMind), LLaMa (Meta),
StableLM (Stability AI).

Language model
A language model encodes statistical information about language. For simplicity, statistical
information tells us how likely something (e.g. a word, a character) is to appear in a given
context. The term token can refer to a word, a character, or a part of a word (like -tion ),
depending on the language model. You can think of tokens as the vocabulary that a language
model uses.

Fluent speakers of a language subconsciously have statistical knowledge of that language. For
example, given the context My favorite color is __ , if you speak English, you know that
the word in the blank is much more likely to be green than car .

Similarly, language models should also be able to fill in that blank. You can think of a language
model as a “completion machine”: given a text (prompt), it can generate a response to complete
that text. Here’s an example:

Prompt (from user): I tried so hard, and got so far


Completion (from language model): But in the end, it doesn't even matter.

As simple as it sounds, completion turned out to be incredibly powerful, as many tasks can be
framed as completion tasks: translation, summarization, writing code, doing math, etc. For
https://huyenchip.com/2023/05/02/rlhf.html 4/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

example, give the prompt: How are you in French is ... , a language model might be able
to complete it with: Comment ça va , effectively translating from one language to another.

To train a language model for completion, you feed it a lot of text so that it can distill statistical
information from it. The text given to the model to learn from is called training data. Consider a
language that contains only two tokens 0 and 1. If you feed a language model the following
sequences as training data, the language model might distill that:

If the context is 01 , the next tokens are likely 01


If the context is 0011 , the next tokens are likely 0011

0101
010101
01010101
0011
00110011
001100110011

Since language models mimic its training data, language models are only as good as their
training data, hence the phrase “Garbage in, garbage out”. If you train a language model on
Reddit comments, you might not want to take it home to show to your parents.

Mathematical formulation
ML task: language modeling
Training data: low-quality data
Data scale: usually in the order of trillions of tokens as of May 2023.
GPT-3’s dataset (OpenAI): 0.5 trillion tokens. I can’t find any public info for GPT-4, but
I’d estimate it to use an order of magnitude more data than GPT-3.
Gopher’s dataset (DeepMind): 1 trillion tokens
RedPajama (Together): 1.2 trillion tokens
LLaMa’s dataset (Meta): 1.4 trillion tokens
Model resulting from this process: LLM

LLMϕ : the language model being trained, parameterized by ϕ. The goal is to find ϕ for
which the cross entropy loss is minimized.
[T1 , T2 , . . . , TV ] : vocabulary – the set of all unique tokens in the training data.
V : the vocabulary size.
f (x): function mapping a token to its position in the vocab. If x is Tk in the vocab, f (x) = k.
Given the sequence (x1 , x2 , . . . , xn ), we’ll have n training samples:
Input: x = (x1 , x2 , . . . , xi−1 )
Ground truth: xi
https://huyenchip.com/2023/05/02/rlhf.html 5/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

For each training sample (x, xi ):


Let k = f (xi )
Model’s output: LLM(x) = [ȳ1 , ȳ2 , . . . , ȳV ]. Note: ∑ j ȳj = 1
The loss value: CE(x, xi ; ϕ) = − log ȳk
Goal: find ϕ to minimize the expected loss on all training samples. CE(ϕ) = −Ex log ȳk

Data bottleneck for pretraining


Today, a language model like GPT-4 uses so much data that there’s a realistic concern that we’ll
run out of Internet data in the next few years. It sounds crazy, but it’s happening. To get a sense
of how big a trillion token is: a book contains around 50,000 words or 67,000 tokens. 1 trillion
tokens are equivalent to 15 million books.

Side-by-side comparison of RedPajama and LLaMa data, done by RedPajama.

The rate of training dataset size growth is much faster than the rate of new data being
generated (Villalobos et al, 2022). If you’ve ever put anything on the Internet, you should
assume that it is already or will be included in the training data for some language models,
https://huyenchip.com/2023/05/02/rlhf.html 6/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

whether you consent or not. This is similar to how, if you post something on the Internet, you
should expect it to be indexed by Google.

On top of that, the Internet is being rapidly populated with data generated by large language
models like ChatGPT. If companies continue using Internet data to train large LLMs, these new
LLMs might just be trained on data generated by existing LLMs.

Once the publicly available data is exhausted, the most feasible path for more training data is
with proprietary data. I suspect that any company that somehow gets its hand on a massive
amount of proprietary data – copyrighted books, translations, video/podcast transcriptions,
contracts, medical records, genome sequences, user data, etc. – will have a competitive
advantage. It’s not surprising that in light of ChatGPT, many companies have changed their data
terms to prevent other companies from scraping their data for LLMs – see Reddit,
StackOverflow.

Phase 2. Supervised finetuning (SFT) for


dialogue
Why SFT
https://huyenchip.com/2023/05/02/rlhf.html 7/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

Pretraining optimizes for completion. If you give the pretrained model a question, say, How to
make pizza , any of the following could be valid completion.

1. Adding more context to the question: for a family of six


2. Adding follow-up questions: ? What ingredients do I need? How much time would
it take?
3. Actually giving the answer

The third option is preferred if you’re looking for an answer. The goal of SFT is to optimize the
pretrained model to generate the responses that users are looking for.

How to do that? We know that a model mimics its training data. During SFT, we show our
language model examples of how to appropriately respond to prompts of different use cases
(e.g. question answering, summarization, translation). The examples follow the format (prompt,
response) and are called demonstration data. OpenAI calls supervised finetuning behavior
cloning: you demonstrate how the model should behave, and the model clones this behavior.

The distribution of prompts used to finetune InstructGPT

To train a model to mimic the demonstration data, you can either start with the pretrained
model and finetune it, or train from scratch. In fact, OpenAI showed that the outputs from the
1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3. However, the
finetuned approach produces much superior results.

Demonstration data
Demonstration data can be generated by humans, like what OpenAI did with InstructGPT and
ChatGPT. Unlike traditional data labeling, demonstration data is generated by highly educated

https://huyenchip.com/2023/05/02/rlhf.html 8/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

labelers who pass a screen test. Among those who labeled demonstration data for InstructGPT,
~90% have at least a college degree and more than one-third have a master’s degree.

OpenAI’s 40 labelers created around 13,000 (prompt, response) pairs for InstructGPT. Here are
a few examples:

Prompt Response

Serendipity means the occurrence and development Running into Margaret and being
of events by chance in a happy or beneficial way. Use introduced to Tom was a fortunate
the word in a sentence. stroke of serendipity.

The anxiety lump in your throat is


caused by muscular tension keeping
your glottis dilated to maximize
ELI5: What's the cause of the "anxiety lump" in our airflow. The clenched chest or
chest during stressful or disheartening experiences? heartache feeling is caused by the
vagus nerve which tells the organs to
pump blood faster, stop digesting,
and produce adrenaline and cortisol.

Create a shopping list from this recipe:


Trim the ends off zucchini. Cut zucchini in half
lengthwise; scoop out pulp, leaving 1/2-in. shells.
Finely chop pulp. In a skillet, cook beef, zucchini
Zucchini, beef, onion, mushroom,
pulp, onion, mushrooms and peppers over medium
peppers, cheese, ketchup, salt,
heat until meat is no longer pink; drain. Remove from
pepper
the heat. Add 1/2 cup cheese, ketchup, salt and
pepper; mix well. Spoon into the zucchini shells.
Place in a greased 13x9-in. baking dish. Sprinkle with
remaining cheese.

https://huyenchip.com/2023/05/02/rlhf.html 9/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

OpenAI’s approach yields high-quality demonstration data but is expensive and time-
consuming. Instead, DeepMind used heuristics to filter for dialogues from Internet data for
their model Gopher (Rae et al., 2021).

»» Side note: DeepMind’s heuristics for dialogues ««

_Concretely, we find all sets of consecutive paragraphs (blocks of text separated by two
newlines) at least 6 paragraphs long, with all paragraphs having a prefix ending in a separator
(e.g., Gopher: , Dr Smith - , or Q. ). The even-indexed paragraphs must have the same
prefix as each other, and the same for the odd-indexed paragraphs, but both prefixes should
be different (in other words, the conversation must be strictly back-and-forth between two
individuals). This procedure reliably yields high-quality dialogue.

»» Side note: on finetuning for dialogues vs. finetuning for following instructions ««

OpenAI’s InstructGPT is finetuned for following instructions. Each example of demonstration


data is a pair of (prompt, response). DeepMind’s Gopher is finetuned for conducting dialogues.
Each example of demonstration is multiple turns of back-and-forth dialogues. Instructions
are subsets of dialogues – ChatGPT is a powered-up version of InstructGPT.

Mathematical formulation
The mathematical formulation is very similar to the one in phase 1.

ML task: language modeling


Training data: high-quality data in the format of (prompt, response)
Data scale: 10,000 - 100,000 (prompt, response) pairs
InstructGPT: ~14,500 pairs (13,000 from labelers + 1,500 from customers)
Alpaca: 52K ChatGPT instructions
Databricks’ Dolly-15k: ~15k pairs, created by Databricks employees
OpenAssistant: 161,000 messages in 10,000 conversations -> approximately 88,000
pairs
Dialogue-finetuned Gopher: ~5 billion tokens, which I estimate to be in the order of
10M messages. However, keep in mind that these are filtered out using heuristics from
the Internet, so not of the highest quality.
Model input and output
Input: prompt
Output: response for this prompt
Loss function to minimize during the training process: cross entropy, but only the tokens
in the response are counted towards the loss.

https://huyenchip.com/2023/05/02/rlhf.html 10/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

Phase 3. RLHF
Empirically, RLHF improves performance significantly compared to SFT alone. However, I
haven’t seen an argument that I find foolproof. Anthropic explained that: “we expect human
feedback (HF) to have the largest comparative advantage over other techniques when people
have complex intuitions that are easy to elicit but difficult to formalize and automate.” (Bai et
al., 2022)

InstructGPT (SFT + RLHF) outperforms SFT alone

Dialogues are flexible. Given a prompt, there are many plausible responses, some are better
than others. Demonstration data tells the model what responses are plausible for a given
context, but doesn’t tell the model how good or how bad a response is.

The idea: what if we have a scoring function that, if given a prompt and a response, outputs a
score for how good that response is? Then we use this scoring function to further train our
LLMs towards giving responses with high scores. That’s exactly what RLHF does. RLHF consists
of two parts:

1. Train a reward model to act as a scoring function.


2. Optimize LLM to generate responses for which the reward model will give high scores.

»»Side note: Hypotheses on why RLHF works««


https://huyenchip.com/2023/05/02/rlhf.html 11/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

Yoav Goldberg has an excellent note on the three hypotheses on why RLHF works.

The diversity hypothesis: during SFT, the model’s output is expected to somewhat match
the demonstrated responses. For example, given the prompt “what’s an example of a
language?”, if the demonstrated response is “Spanish” and the model’s response is “Java”,
the model’s response might be marked as wrong.
The negative feedback hypothesis: demonstration only gives the model positive signals
(e.g. only showing the model good responses), not negative signals (e.g. showing models
what bad responses look like). RL allows us to show models negative signals.
The hallucination hypothesis: RLHF is supposed to help with hallucination, which we’ll go
into in the RLHF and hallucination section.

3.1. Reward model (RM)


The RM’s job is to output a score for a pair of (prompt, response). Training a model to output a
score on a given input is a pretty common task in ML. You can simply frame it as a classification
or a regression task. The challenge with training a reward model is with obtaining trustworthy
data. Getting different labelers to give consistent scores for the same response turns out to be
quite difficult. It’s a lot easier to ask labelers to compare two responses and decide which one is
better.

The labeling process would produce data that looks like this: (prompt, winning_response,
losing_response). This is called comparison data.

Here’s an example of comparison data from Anthropic’s HH-RLHF dataset. Personally, I prefer
the losing_response. This highlights a challenge for learning from human preferences. Human
preferences are diverse and impossible to capture in a single mathematical formulation.

prompt winning_response losing_response

I don't know that we should get the dog high. I


How can I get I'm not sure what you
think it's important for a dog to experience the
my dog high? mean by that.
world in a sober state of mind.

Now comes the trippy part: given only this comparison data, how do you train the model to give
concrete scores? Just like how you can get humans to do (basically) anything given the right
incentive, you can get a model to do (basically) anything given the right objective (aka loss
function).

https://huyenchip.com/2023/05/02/rlhf.html 12/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

For InstructGPT, the objective is to maximize the difference in score between the winning
response and the losing response (see detail in the section Mathematical formulation).

People have experimented with different ways to initialize an RM: e.g. training an RM from
scratch or starting with the SFT model as the seed. Starting from the SFT model seems to give
the best performance. The intuition is that the RM should be at least as powerful as the LLM to
be able to score the LLM’s responses well.

Mathematical formulation

There might be some variations, but here’s the core idea.

Training data: high-quality data in the format of (prompt, winning_response,


losing_response)
Data scale: 100K - 1M examples
InstructGPT: 50,000 prompts. Each prompt has 4 to 9 responses, forming between 6
and 36 pairs of (winning_response, losing_response). This means between 300K and
1.8M training examples in the format of (prompt, winning_response,
losing_response).
Constitutional AI, which is suspected to be the backbone of Claude (Anthropic): 318K
comparisons – 135K generated by humans, and 183K generated by AI. Anthropic has
an older version of their data open-sourced (hh-rlhf), which consists of roughly 170K
comparisons.

r θ : the reward model being trained, parameterized by θ. The goal of the training process is
to find θ for which the loss is minimized.
Training data format:
x: prompt
yw : winning response
yl : losing response
For each training sample (x, yw , yl )
sw = r θ (x, yw): reward model’s score for the winning response
sl = r θ (x, yl ): reward model’s score for the losing response
Loss value: − log(σ(sw − sl ))
Goal: find θ to minimize the expected loss for all training samples. −Ex log(σ(sw − sl ))

To get more intuition how this loss function works, let’s visualize it.

Let d = sw − sl . Here’s the graph for f (d) = − log(σ(d)). The loss value is large for negative d ,
which incentivizes the reward model to not give the winning response a lower score than the
losing response.

https://huyenchip.com/2023/05/02/rlhf.html 13/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

UI to collect comparison data

Below is a screenshot of the UI that OpenAI’s labelers used to create training data for
InstructGPT’s RM. Labelers both give concrete scores from 1 to 7 and rank the responses in the
order of preference, but only the ranking is used to train the RM. Their inter-labeler agreement
is around 73%, which means if they ask 10 people to rank 2 responses, 7 of them will have the
same ranking.

https://huyenchip.com/2023/05/02/rlhf.html 14/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

To speed up the labeling process, they ask each annotator to rank multiple responses. 4 ranked
responses, e.g. A > B > C > D, will produce 6 ranked pairs, e.g. (A > B), (A > C), (A > D), (B > C), (B >
D), (C > D).

3.2. Finetuning using the reward model


In this phase, we will further train the SFT model to generate output responses that will
maximize the scores by the RM. Today, most people use Proximal Policy Optimization (PPO), a
reinforcement learning algorithm released by OpenAI in 2017.

During this process, prompts are randomly selected from a distribution – e.g. we might
randomly select among customer prompts. Each of these prompts is input into the LLM model

https://huyenchip.com/2023/05/02/rlhf.html 15/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

to get back a response, which is given a score by the RM.

OpenAI also found that it’s necessary to add a constraint: the model resulting from this phase
should not stray too far from the model resulting from the SFT phase (mathematically
represented as the KL divergence term in the objective function below) and the original
pretraining model. The intuition is that there are many possible responses for any given
prompt, the vast majority of them the RM has never seen before. For many of those unknown
(prompt, response) pairs, the RM might give an extremely high or low score by mistake.
Without this constraint, we might bias toward those responses with extremely high scores,
even though they might not be good responses.

OpenAI has this great diagram that explains the SFT and RLHF for InstructGPT.

Mathematical formulation

ML task: reinforcement learning


Action space: the vocabulary of tokens the LLM uses. Taking action means choosing a
token to generate.
Observation space: the distribution over all possible prompts.
Policy: the probability distribution over all actions to take (aka all tokens to generate)
given an observation (aka a prompt). An LLM constitutes a policy because it dictates
how likely a token is to be generated next.
https://huyenchip.com/2023/05/02/rlhf.html 16/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

Reward function: the reward model.


Training data: randomly selected prompts
Data scale: 10,000 - 100,000 prompts
InstructGPT: 40,000 prompts

RM : the reward model obtained from phase 3.1.


LLMSFT : the supervised finetuned model obtained from phase 2.
Given a prompt x, it outputs a distribution of responses.
In the InstructGPT paper, LLM SFT is represented as π SFT .
LLMϕRL: the model being trained with reinforcement learning, parameterized by ϕ.
The goal is to find ϕ to maximize the score according to the RM .
Given a prompt x, it outputs a distribution of responses.
In the InstructGPT paper, LLMϕRL is represented as πϕRL .
x: prompt
DRL: the distribution of prompts used explicitly for the RL model.
Dpretrain : the distribution of the training data for the pretrain model.

For each training step, you sample a batch of xRL from DRL and a batch of xpretrain from Dpretrain .
The objective function for each sample depends on which distribution the sample comes from.

1. For each xRL, we use LLMϕRL to sample a response: y ∼ LLMϕRL(xRL). The objective is
computed as follows. Note that the second term in this objective is the KL divergence to
make sure that the RL model doesn’t stray too far from the SFT model.

LLMϕRL(y|x)
objective1 (xRL, y; ϕ) = RM(xRL, y) − β log
LLMSFT (y|x)

2. For each xpretrain , the objective is computed as follows. Intuitively, this objective is to make
sure that the RL model doesn’t perform worse on text completion – the task the pretrained
model was optimized for.

objective2 (xpretrain ; ϕ) = γ log LLMϕRL(xpretrain )

The final objective is the sum of the expectation of two objectives above. In the RL setting, we
maximize the objective instead of minimizing the objective as done in the previous steps.

LLMϕRL(y|x)
objective(ϕ) = Ex∼DRL Ey∼LLMϕRL(x) [RM(x, y) − β log ] + γ Ex∼Dpretrain log LLMϕRL(x)
LLMSFT (y|x)

Note:

The notation used is slightly different from the notation used in the InstructGPT paper, as I find
the notation here a bit more explicit, but they both refer to the exact same objective function.
https://huyenchip.com/2023/05/02/rlhf.html 17/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

The objective function as written in the InstructGPT paper.

RLHF and hallucination


Hallucination happens when an AI model makes stuff up. It’s a big reason why many companies
are hesitant to incorporate LLMs into their workflows.

There are two hypotheses that I found that explain why LLMs hallucinate.

The first hypothesis, first expressed by Pedro A. Ortega et al. at DeepMind in Oct 2021, is that
LLMs hallucinate because they “lack the understanding of the cause and effect of their actions”
(back then, DeepMind used the term “delusion” for “hallucination”). They showed that this can
be addressed by treating response generation as causal interventions.

The second hypothesis is that hallucination is caused by the mismatch between the LLM’s
internal knowledge and the labeler’s internal knowledge. In his UC Berkeley talk (April 2023),
John Schulman, OpenAI co-founder and PPO author, suggested that behavior cloning causes
hallucination. During SFT, LLMs are trained to mimic responses written by humans. If we give a
response using the knowledge that we have but the LLM doesn’t have, we’re teaching the LLM
to hallucinate.

This view was also well articulated by Leo Gao, another OpenAI employee, in Dec 2021. In
theory, the human labeler can include all the context they know with each prompt to teach the
model to use only the existing knowledge. However, this is impossible in practice.

Schulman believed that LLMs know if they know something (which is a big claim, IMO), this
means that hallucination can be fixed if we find a way to force LLMs to only give answers that
contain information they know. He then proposed a couple of solutions.

1. Verification: asking the LLM to explain (retrieve) the sources where it gets the answer
from.
2. RL. Remember that the reward model in phase 3.1 is trained using only comparisons:
response A is better than response B, without any information on how much better or why
A is better. Schulman argued that we can solve hallucination by having a better reward
function, e.g. punishing a model more for making things up.

https://huyenchip.com/2023/05/02/rlhf.html 18/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

Here’s a screenshot from John Schulman’s talk in April 2023.

From Schulman’s talk, I got the impression that RLHF is supposed to help with hallucination.
However, the InstructGPT paper shows that RLHF actually made hallucination worse. Even
though RLHF caused worse hallucination, it improved other aspects, and overall, human
labelers prefer RLHF model over SFT alone model.

https://huyenchip.com/2023/05/02/rlhf.html 19/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

Hallucination is worse for InstructGPT (RLHF + SFT) compared to just SFT (Ouyang et al., 2022)

Based on the assumption that LLMs know what they know, some people try to reduce
hallucination with prompts, e.g. adding Answer as truthfully as possible, and if
you're unsure of the answer, say "Sorry, I don't know" . Making LLMs respond
concisely also seems to help with hallucination – the fewer tokens LLMs have to generate, the
less chance they have to make things up.

Conclusion
This has been a really fun post to write – I hope you enjoyed reading it too. I had another whole
section about the limitations of RLHF – e.g. biases in human preference, the challenge of
evaluation, and data ownership issue – but decided to save it for another post because this one
has gotten long.

As I dove into papers about RLHF, I was impressed by three things:

https://huyenchip.com/2023/05/02/rlhf.html 20/31
5/27/25, 9:37 PM RLHF: Reinforcement Learning from Human Feedback

1. Training a model like ChatGPT is a fairly complicated process – it’s amazing it worked at
all.
2. The scale is insane. I’ve always known that LLMs require a lot of data and compute, but the
entire Internet data!!??
3. How much companies (used to) share about their process. DeepMind’s Gopher paper is 120
pages. OpenAI’s InstructGPT paper is 68 pages, Anthropic shared their 161K hh-rlhf
comparison examples, Meta made available their LLaMa model for research. There’s also
an incredible amount of goodwill and drive from the community to create open-sourced
models and datasets, such as OpenAssistant and LAION. It’s an exciting time!

We’re still in the early days of LLMs. The rest of the world has just woken up to the potential of
LLMs, so the race has just begun. Many things about LLMs, including RLHF, will evolve. But I
hope that this post helped you understand better how LLMs are trained under the hood, which
can hopefully help you with choosing the best LLM for your need!

Get email updates about my new posts Subscribe

https://huyenchip.com/2023/05/02/rlhf.html 21/31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy