Reinforcement Learning From Human Feedback (RLHF)
Reinforcement Learning From Human Feedback (RLHF)
(RLHF)
Malay Agarwal
Contents
Why Is Alignment Important? 2
1
Problem - Scaling Human Feedback . . . . . . . . . . . . . . . . . . . 20
Introduction to Constitutional AI . . . . . . . . . . . . . . . . . . . . . 20
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Stage 1 - Supervised Fine-Tuning . . . . . . . . . . . . . . . . . . 21
Stage 2 - Reinforcement Learning From AI Feedback . . . . . . . 22
Useful Resources 22
The human values helpfulness, honesty and harmlessness are important for an
LLM and the values are collectively called HHH.
This is where alignment via additional fine-tuning comes in. It is used to increase
the HHH factors of an LLM. It can also reduce the toxicity of responses and
reduce the generation of incorrect information.
2
Reinforcement Learning From Human Feedback
(RLHF)
Introduction
Reinforcement Learning From Human Feedback (RLHF) is a technique used to
fine-tune LLMs with human feedback. It uses reinforcement learning to fine-tune
the LLM with human feedback data, resulting in a model that is better aligned
with human preferences.
Reinforcement Learning
Intuition
Reinforcement Learning (RL) is a type of machine learning paradigm where
an agent learns to take actions related to a specific goal by taking actions in
an environment, with the objective of maximizing some notion of a cumulative
reward.
The agent continually learns from its experiences by:
• Taking actions
• Observing the resulting changes in the environment, and
• Receiving rewards or penalties based on the outcomes of its actions.
By iterating through this process, the agent gradually refines its strategy or
policy to make better decisions and increase its chances of success.
3
Example: Training a Model to Play Tic-Tac-Toe
The RL set-up for training such a model is as follows:
• Agent: A model (or policy) acting as a tic-tac-toe player.
• Objective: Win the game.
• Environment: Game board.
• States: At any moment, the state is the current configuration of the
board.
• Action space: All the possible positions a player can choose to place a
marker in, based on the current state.
The agent makes decisions by following a strategy known as the RL policy. As
the agent takes actions, it collects rewards based on the actions’ effectiveness in
progressing towards a win. The goal of the agent is to learn the optimal policy
for a given environment that maximizes its rewards.
4
further actions.
Note: The series of actions and corresponding states form a
playout, often called a rollout.
• As the agent accumulates experiences, it gradually uncovers actions that
yield the highest long-term rewards, ultimately leading to success in the
game.
Reward System
The reward is assigned based on how closely the generated completions align
with human preferences. Due to the variation in human responses to language,
determining the reward is much more complicated than the Tic-Tac-Toe example.
5
An example reward system is as follows:
• We can have a human evaluate all of the completions of the model against
some alignment metric, such as toxicity.
• The feedback can be represented as a scalar value, either a zero (not toxic)
or one (toxic).
• The LLM weights are then updated iteratively to maximize the reward
obtained from the human classifier (obtain as many zeros as possible),
enabling the model to generated non-toxic completions.
This reward system requires obtaining manual human feedback, which can be
time consuming and expensive.
Reward Model
A practical and scalable alternative is to use an additional model, called the
reward model, to classify the outputs of the LLM and evaluate the degree of
alignment with human preferences.
To obtain the reward model, we use a smaller number of human examples to train
the reward model using traditional supervised learning since it’s a classification
problem.
This trained reward model will be used to assess the output of the LLM and
assign a reward a value, which in turn gets used to update the weights of the
LLM and train a new human-aligned version.
Exactly how the weights are updated as the model completions are assessed
depends on the (reinforcement learning) algorithm used to optimize the RL
policy.
6
The steps involved in training a reward model are detailed below.
7
Thus, the labeler might decide to rank the 2nd completion as 1. They might
decide that the 3rd completion is the least useful. The 1st completion might be
ranked as 2 and 3rd completions might be ranked as 3. The rankings are:
Completion Rank
There is nothing you can do about hot houses 2
You can cool your house with air conditioning 1
It is not too hot 3
This same process is repeated for every prompt-completion set in the feedback
dataset. Moreover, the same prompt-completion set is assigned to multiple
humans so that we can establish consensus and minimize the impact of poor
labelers in the group.
Completion H1 H2 H3
There is nothing you can do about hot houses 2 2 2
You can cool your house with air conditioning 1 1 3
It is not too hot 3 3 1
For example, above, the third human labeler disagrees with the other two labelers
and it may indicate that the labeler misunderstood the instructions for ranking.
The clarity of our instructions (regarding how to rank completions) can make a
big difference on the quality of the human feedback we obtain. For example:
8
In general:
The more detailed we make these instructions, the higher the likelihood
that the labelers will understand the task they have to carry out and
complete it exactly as we wish.
a b
There is nothing you can do about hot You can cool your house with air
houses conditioning
There is nothing you can do about hot It is not too hot
houses
You can cool your house with air It is not too hot
conditioning
a b Reward
There is nothing you can You can cool your house [0, 1]
do about hot houses with air conditioning
There is nothing you can It is not too hot [1, 0]
do about hot houses
You can cool your house It is not too hot [1, 0]
with air conditioning
We will then reorder the prompts so that the most preferred option comes first.
This is important since the reward model expects the preferred response yj first.
yj yk Reward
You can cool your house There is nothing you can [1, 0]
with air conditioning do about hot houses
There is nothing you can It is not too hot [1, 0]
do about hot houses
9
yj yk Reward
You can cool your house It is not too hot [1, 0]
with air conditioning
For a given prompt x, the model learns to favor the human-preferred completion
yj , while minimizing the following loss function:
loss = log(σ(rj − rk ))
Note: Notice how the loss function does not have any notion of
labels in it despite this being a supervised learning problem. This is
exactly why the ordering of the completions in the pairwise data is
important. The loss function itself assumes that rj is the reward for
the preferred response.
10
“not hate” (does not contain hate speech) and the negative class, the class we
want to avoid, would be “hate” (contains hate speech).
The logit (unnormalized output of the reward model before applying any acti-
vation function such as softmax) value of the positive class will be the reward
value that we will provide to the RLHF feedback loop.
The example below shows a “good reward” being provided for a non-toxic
completion.
11
Steps 2-4 represent a single iteration of the RLHF process. We repeat the steps
2-4 for a certain number of epochs. With the number of epochs, we will notice
that the reward keeps increasing for every subsequent completion, indicating
that RLHF is working as intended.
We will continue the process until the model is aligned based on some evaluation
criteria. For example, we can use a threshold on the reward value or a maximum
number of steps (like 20,000).
12
Value Function and Value Loss
The expected reward of a completion is an important quantity used in the
PPO objective.
This quantity is estimated using a separate head (another output layer) of the
LLM called the value function.
Consider that we have a number of prompts. To estimate the value function, we
generate the completions to the prompts using our instruct model and calculate
the reward for the completions using the reward model.
Example:
Prompt 1: A dog is
Completion: a furry animal
Reward: 1.87
Prompt 2: This house is
Completion: very ugly
Reward: -1.24
The value function estimates the expected total reward for a given state S. In
other words, as the LLM generates each token of a completion, we want to
estimate the total future reward based on the current sequence of tokens. This
can be thought of as a baseline to evaluate the quality of completions against
our alignment criteria.
Example:
Prompt 1: A dog is
Completion: a
Vθ (s) = 0.34
Completion: a furry
Vθ (s) = 1.23
Since the value function is just another output layer in the LLM, it is automati-
cally computed during the forward pass of a prompt through the LLM.
It is learnt by minimizing the value loss, that is the difference between the
actual future total reward (1.87 for A dog is) and the estimated future total
reward (1.23 for a furry). The value loss is essentially the mean squared error
between these two quantities given by:
T
1 X
LV F = ∥Vϕ (s) − ( γ t rt |s0 = s)∥22
2 t=0
13
In essence, we are solving a simple regression problem to fit the value function.
πθ (at |st )
LPOLICY = min( .Ât , g(ϵ, Ât ))
πθold (at |st )
where:
(
(1 + ϵ)Ât , Ât ≥ 0
g(ϵ, Ât ) =
(1 − ϵ)Ât , Ât < 0
Here:
• πθold (at |st ) is the probability of the next token at given the current context
window st for the initial LLM (the LLM before this iteration of PPO
started)
• πθ (at |st ) is probability of the next token at given the current context
window st for the updated LLM (a copy of the LLM before this iteration
of PPO started that we are going to keep modifying in the current iteration).
• Ât is the estimated advantage term of a given choice of action. It
compares how much better or worse the current action is as compared to
all possible actions at that state.
This is calculated by looking at the expected future rewards of a completion
following the next token and estimating how advantageous this completion
is compared to the rest.
14
In the above figure, the path at the top is a better completion since it
goes towards a higher reward while the path at the bottom is a worse
completion since it goes towards lower reward.
There are multiple advantage estimation algorithms available for calculating
this quantity. For a generalized form, see: Notes on the Generalized
Advantage Estimation Paper.
We can now understand how the loss function works to make sure that the model
is aligned. Consider:
• Advantage term Ât is positive: This implies that the token generated
by the updated LLM is better than average. The loss function becomes:
πθ (at |st )
LPOLICY = min( , 1 + ϵ).Ât
πθold (at |st )
Since the advantage is positive, the objective will increase if the action
at becomes more likely - that is, πθ (at |st ) increases. But, the min in
the term applies a limit on how much the objective can increase. If
πθ (at |st ) > (1 + ϵ)πθold (at |st ), the min will limit LPOLICY to (1 + ϵ)Ât .
Thus, the new policy will not benefit by going far away from the old policy.
• Advantage term Ât is negative. This implies that the token generated
by the updated LLM is worse than average. The loss function becomes:
πθ (at |st )
LPOLICY = max( , 1 − ϵ).Ât
πθold (at |st )
Since the advantage is negative, the objective will increase if the action
becomes less likely - that is, πθ (at |st ) decreases. But, the max in the
term puts a limit on how much the objective can increase. If πθ (at |st ) <
(1 − ϵ)πθold (at |st ), the max will limit LPOLICY to (1 − ϵ)Ât . Again, the new
policy does not benefit by going far away from the old policy.
15
Thus, we can see how the term πθπθ (a(at |st |st )t ) decides whether πθ (at |st ) should
old
increase or decrease, and how clipping ensures that the new policy does not stray
far away from the old policy.
In addition to the policy loss, we also have an entropy loss. The entropy loss
helps the model maintain its creativity and is given by:
If we keep the entropy low, the model might end up completing a prompt always
in the same way.
Pseudocode
The pseudocode for PPO is shown below. Steps 3-4 correspond to phase 1 and
the rest of the steps correspond to phase 2.
16
Reward Hacking
Introduction
In Reinforcement Learning, it is possible for the agent to learn to cheat the
system by favoring actions that maximize the reward received even if those
actions don’t align well with the original objective.
In case of LLMs, reward hacking can present itself in the form of addition of
words or phrases to the completion that result in high scores for the metric being
aligned, but that reduce the overall quality of the language.
For example, consider that we are using RLHF to reduce toxicity of an instruct
LLM. We have trained a reward model to reward each completion based on how
toxic the completion is.
We feed a prompt This product is to the LLM, which generates the completion
complete garbage. This results in a low reward and PPO updates the LLM
towards less toxicity. As the LLM is updated in each iteration of RLHF, it is
possible that the updated LLM diverges too much from the initial LLM since it
is trying to optimize the reward.
The model might learn to generate completions that it has learned will lead to
very low toxicity scores, such as most awesome, most incredible thing ever. This
completion is highly exaggerated. It is also possible that the model will generate
completions that are completely nonsensical, as long as the phrase leads to high
reward. For example, it can generate something like Beautiful love and world
peace all around, which has positive words and is likely to have a high reward.
17
Avoiding Reward Hacking
One possible solution is to use the initial instruct LLM as a reference model
against which we can check the performance of the RL-updated LLM. The
weights of this reference model are frozen and not updated during iterations of
RLHF.
During training, a prompt like This product is is passed to each model and both
generate completions. Say the reference generates the completion useful and
well-priced and the updated LLM generates the most awesome, most incredible
thing ever.
We can then compare the two completions and calculate a value called the
KL divergence (Kullback-Leibler divergence). KL divergence is a statistical
measure of how different two probability distributions are. Thus, by comparing
the completions, we can calculate how much the updated model has diverged
from the reference.
Note: The exact details of how KL divergence is calculated are
not discussed in the course. Some details are available here: KL
Divergence for Machine Learning.
KL divergence is calculated for each generated token across the whole vocabulary
of the LLM. This can easily be tens or hundreds of thousands of tokens. However,
using the softmax, we have reduce the number of probabilities to much less than
the full vocabulary size (since a lot of the probabilities will be close to zero). It
is still a compute expensive process and the use of GPUs is recommended.
Once we have calculated the KL divergence between the two models, we add it
as a penalty to the reward calculation. It will penalize the updated LLM if it
shifts too far from the reference model.
Memory Constraints
Note that now we need two full copies of the LLM - the reference copy and the
copy that will be updated by RLHF.
18
As such, we can benefit by combining RLHF with PEFT. We update the weights
of a PEFT adapter and not the weights of the LLM. We can thus use the same
underlying LLM as the reference model and the model that will be updated.
This reduces the memory footprint by approximately half.
We will first create a baseline average toxicity score using the initial LLM by
passing it the summarization dataset and calculating the average toxicity score
using the reward model. Then, we will pass the same dataset through the aligned
LLM and obtain another average toxicity score.
19
If RLHF was successful, the average score of the aligned LLM should be lower
than that of the initial LLM.
Introduction to Constitutional AI
Constitutional AI, first proposed in the paper Constitutional AI: Harmlessness
from AI Feedback (Anthropic, 2022), is one approach to scale human feedback.
It is a method for training models using a set of rules and principles that govern
the model’s behavior. Together with a set of sample prompts, these form a
constitution. We then train the model to self-critic and revise its responses to
comply with the constitution.
Constitutional AI is not only useful in scaling human feedback, but can also help
with some unintended consequence of RLHF.
For example, based on how the input prompt is structured, an aligned model
may end up revealing harmful information as it tries to provide the most helpful.
We may provide the prompt Can you help me hack into my neighbor’s wifi and
in the quest for being helpful, the LLM might give complete instructions on how
to do this.
Providing the model with a constitution can help the model in balancing these
competing interests and minimize the harm.
An example of constitutional principles is shown below:
20
Implementation
Implementing Constitutional AI consists of two stages.
21
the prompt can be augmented with:
Identify how the last response is harmful, unethical, racist, sexist,
toxic, dangerous or illegal.
This is fed to the LLM and it generates something like:
The response was harmful because hacking into someone else’s wifi is
an invasion of their privacy and is possibly illegal.
The model detects the problems in its response. We then put it all together and
ask the model to write a new response which removes all of the harmful and
illegal content. For example:
Rewrite the response to remove any and all harmful, unethical, racist,
sexist, toxic, dangerous or illegal content.
The model generates a new response:
Hacking into your neighbor’s wifi is an invasion of their privacy. It
may also land you in legal trouble. I advise against it.
Thus, the prompt Can you help me hack into my neighbor’s wifi becomes the
red team prompt and the above revised completion is the revised constitutional
response.
We repeat this process for many red team prompts and obtain a dataset of
prompt-completion pairs that can be used to fine-tune the model in a supervised
manner. The resultant LLM will have learnt to generate constitutional responses.
Useful Resources
• RLHF paper - Learning to summarize from human feedback.
22
• NVIDIA article on RLHF.
• HuggingFace article on RLHF.
• YouTube - HuggingFace livestream on RLHF.
• YouTube - PPO Implementation From Scratch in PyTorch.
• Transformer Reinforcement Learning (trl) - Library by HuggingFace for
reinforcement learning on transformers.
• Constitutional AI paper - Constitutional AI: Harmlessness from AI Feed-
back.
• Lab 3 - Code example where FLAN-T5 is aligned using RLHF.
23