0% found this document useful (0 votes)

92 views23 pages

Reinforcement Learning From Human Feedback (RLHF)

Uploaded by

Habib Mrad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views23 pages

Reinforcement Learning From Human Feedback (RLHF)

Uploaded by

Habib Mrad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Reinforcement Learning From Human Feedback

(RLHF)

Malay Agarwal

Contents
Why Is Alignment Important? 2

Reinforcement Learning From Human Feedback (RLHF) 3

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Example: Training a Model to Play Tic-Tac-Toe . . . . . . . . . 4
Using Reinforcement Learning To Align LLMs . . . . . . . . . . . . . 5
Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Reward System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Reward Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Obtaining Human Feedback and Training the Reward Model . . . . . 6
Prepare Dataset For Human Feedback . . . . . . . . . . . . . . . 7
Gather Human Feedback . . . . . . . . . . . . . . . . . . . . . . . 7
Prepare Labeled Data For Training . . . . . . . . . . . . . . . . . 9
Training the Reward Model . . . . . . . . . . . . . . . . . . . . . 10
Obtaining the Reward Value . . . . . . . . . . . . . . . . . . . . 10
Fine-Tuning the LLM Using the Reward Model . . . . . . . . . . . . . 11
Proximal Policy Optimization (PPO) . . . . . . . . . . . . . . . . . . . 12
Phase 1: Create Completions . . . . . . . . . . . . . . . . . . . . 12
Value Function and Value Loss . . . . . . . . . . . . . . . . . . . 13
Phase 2: Advantage Estimation . . . . . . . . . . . . . . . . . . . 14
Final PPO Objective . . . . . . . . . . . . . . . . . . . . . . . . . 16
Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Reward Hacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Avoiding Reward Hacking . . . . . . . . . . . . . . . . . . . . . . 18
Memory Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 18
Evaluating the Human-Aligned LLM . . . . . . . . . . . . . . . . . . . 19

Model Self-Supervision With Constitutional AI 20

1
Problem - Scaling Human Feedback . . . . . . . . . . . . . . . . . . . 20
Introduction to Constitutional AI . . . . . . . . . . . . . . . . . . . . . 20
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Stage 1 - Supervised Fine-Tuning . . . . . . . . . . . . . . . . . . 21
Stage 2 - Reinforcement Learning From AI Feedback . . . . . . . 22

Useful Resources 22

Why Is Alignment Important?

The goal of fine-tuning with instructions is to train a model further so that it
better understands human-like prompts and generates more human-like responses.
It can improve a model’s performance substantially and lead to more natural
sounding language.
This brings a new set of challenges. For example, the model can generate
responses with toxic language, combative and aggressive voices, and also provide
detailed information about dangerous topics. This is because LLMs are trained
on large text data from the Internet, where such language is common.
Example: The Washington Post analyzed Google’s C4 dataset (used
to train T5 and LLaMa) and found that it contains large amounts of
harmful content including swastikas, white supremacist and anti-trans
content, and content about false and dangerous conspiracy theories
such as pizzagate. Reference: Inside the secret list of websites that
make AI like ChatGPT sound smart.

The human values helpfulness, honesty and harmlessness are important for an
LLM and the values are collectively called HHH.
This is where alignment via additional fine-tuning comes in. It is used to increase
the HHH factors of an LLM. It can also reduce the toxicity of responses and
reduce the generation of incorrect information.

2
Reinforcement Learning From Human Feedback
(RLHF)
Introduction
Reinforcement Learning From Human Feedback (RLHF) is a technique used to
fine-tune LLMs with human feedback. It uses reinforcement learning to fine-tune
the LLM with human feedback data, resulting in a model that is better aligned
with human preferences.

We can use RLHF to make sure that:

• Our model produces outputs with maximum usefulness and relevance to
the input prompt.
• More importantly, the potential for harm is minimized. The model can be
trained to give caveats that acknowledge its limitations, and avoid toxic
language and topics.
It can also be used to personalize the experience of using LLMs. The model
can learn the preferences of each individual user through a continuous feedback
process.

Reinforcement Learning
Intuition
Reinforcement Learning (RL) is a type of machine learning paradigm where
an agent learns to take actions related to a specific goal by taking actions in
an environment, with the objective of maximizing some notion of a cumulative
reward.
The agent continually learns from its experiences by:
• Taking actions
• Observing the resulting changes in the environment, and
• Receiving rewards or penalties based on the outcomes of its actions.
By iterating through this process, the agent gradually refines its strategy or
policy to make better decisions and increase its chances of success.

3
Example: Training a Model to Play Tic-Tac-Toe
The RL set-up for training such a model is as follows:
• Agent: A model (or policy) acting as a tic-tac-toe player.
• Objective: Win the game.
• Environment: Game board.
• States: At any moment, the state is the current configuration of the
board.
• Action space: All the possible positions a player can choose to place a
marker in, based on the current state.
The agent makes decisions by following a strategy known as the RL policy. As
the agent takes actions, it collects rewards based on the actions’ effectiveness in
progressing towards a win. The goal of the agent is to learn the optimal policy
for a given environment that maximizes its rewards.

The learning process is iterative and involves trial and error:

• Initially, the agent takes a random action which leads to a new state.
• From this state, the agent proceeds to explore subsequent states through

4
further actions.
Note: The series of actions and corresponding states form a
playout, often called a rollout.
• As the agent accumulates experiences, it gradually uncovers actions that
yield the highest long-term rewards, ultimately leading to success in the
game.

Using Reinforcement Learning To Align LLMs

Set-up
When we apply RL to fine-tune LLMs, we have the following scenario:
• Agent: The LLM.
• Objective: Generate human-aligned text.
• Environment: Context window of the model (the space in which text
can be entered via a prompt).
• State: At any moment, the current state is the current contents of the
context window.
• Action space: The token vocabulary since each action is the act of
generating tokens.

Each action can be generating a single word, a sentence or a longer-form text

depending on the task we are fine-tuning for. At any given moment, the action
that the model will take, meaning which token it will choose next, depends on
the prompt text in the context window and the probability distribution over the
vocabulary space.

Reward System
The reward is assigned based on how closely the generated completions align
with human preferences. Due to the variation in human responses to language,
determining the reward is much more complicated than the Tic-Tac-Toe example.

5
An example reward system is as follows:
• We can have a human evaluate all of the completions of the model against
some alignment metric, such as toxicity.
• The feedback can be represented as a scalar value, either a zero (not toxic)
or one (toxic).
• The LLM weights are then updated iteratively to maximize the reward
obtained from the human classifier (obtain as many zeros as possible),
enabling the model to generated non-toxic completions.
This reward system requires obtaining manual human feedback, which can be
time consuming and expensive.

Reward Model
A practical and scalable alternative is to use an additional model, called the
reward model, to classify the outputs of the LLM and evaluate the degree of
alignment with human preferences.

To obtain the reward model, we use a smaller number of human examples to train
the reward model using traditional supervised learning since it’s a classification
problem.
This trained reward model will be used to assess the output of the LLM and
assign a reward a value, which in turn gets used to update the weights of the
LLM and train a new human-aligned version.
Exactly how the weights are updated as the model completions are assessed
depends on the (reinforcement learning) algorithm used to optimize the RL
policy.

Obtaining Human Feedback and Training the Reward Model

The reward model is the central component of RLHF. It encodes all the prefer-
ences that have been learned from human feedback and plays a central role in
how the LLM updates its weights over many iterations.

6
The steps involved in training a reward model are detailed below.

Prepare Dataset For Human Feedback

We select an LLM which will help us prepare a dataset for human feedback.
Note: This is the LLM we are planning to align.
The LLM should have some capability to carry out the task we are fine-tuning
for. In general, it will be easier to start with an instruct model that has already
been fine-tuned across multiple tasks and has some general capabilities.
We will use this LLM with a prompt dataset to generate a number of different
responses for each prompt. Thus, the prompt dataset comprises of multiple
prompts and each of these prompts gets processed by the LLM to produce a set
of completions.

Gather Human Feedback

We use the prepared dataset to gather human feedback.
We must first decide the criterion we want humans to assess the completions on,
such as helpfulness or toxicity. We will then ask human labelers to assess each
completion in the dataset based on the decided criterion.
Consider the example prompt:
My house is too hot.
Let the completions generated by the LLM used to prepare the feedback dataset
for this prompt be:
1. There is nothing you can do about hot houses.
2. You can cool your house with air conditioning.
3. It is not too hot.
Suppose the criterion we have selected for alignment is helpfulness. The labeler’s
job is to rank these three completions in order of helpfulness, from the most to
the least helpful.

7
Thus, the labeler might decide to rank the 2nd completion as 1. They might
decide that the 3rd completion is the least useful. The 1st completion might be
ranked as 2 and 3rd completions might be ranked as 3. The rankings are:

Completion Rank
There is nothing you can do about hot houses 2
You can cool your house with air conditioning 1
It is not too hot 3

This same process is repeated for every prompt-completion set in the feedback
dataset. Moreover, the same prompt-completion set is assigned to multiple
humans so that we can establish consensus and minimize the impact of poor
labelers in the group.

Completion H1 H2 H3
There is nothing you can do about hot houses 2 2 2
You can cool your house with air conditioning 1 1 3
It is not too hot 3 3 1

For example, above, the third human labeler disagrees with the other two labelers
and it may indicate that the labeler misunderstood the instructions for ranking.
The clarity of our instructions (regarding how to rank completions) can make a
big difference on the quality of the human feedback we obtain. For example:

8
In general:
The more detailed we make these instructions, the higher the likelihood
that the labelers will understand the task they have to carry out and
complete it exactly as we wish.

Prepare Labeled Data For Training

We need to convert the rankings into pairwise training data for the reward model.
Consider the above example again. With three completions, there are three
possible pairings:

a b
There is nothing you can do about hot You can cool your house with air
houses conditioning
There is nothing you can do about hot It is not too hot
houses
You can cool your house with air It is not too hot
conditioning

Depending on the number of different completions n for a prompt, we will have

n

2 different pairs.
For each pair, we will assign a reward of 1 for the preferred response and a
reward of 0 for the less preferred response.

a b Reward
There is nothing you can You can cool your house [0, 1]
do about hot houses with air conditioning
There is nothing you can It is not too hot [1, 0]
do about hot houses
You can cool your house It is not too hot [1, 0]
with air conditioning

We will then reorder the prompts so that the most preferred option comes first.
This is important since the reward model expects the preferred response yj first.

yj yk Reward
You can cool your house There is nothing you can [1, 0]
with air conditioning do about hot houses
There is nothing you can It is not too hot [1, 0]
do about hot houses

9
yj yk Reward
You can cool your house It is not too hot [1, 0]
with air conditioning

Note: Many LLMs use a thumbs-up, thumbs-down feedback system

to obtain the feedback dataset since it is easier to obtain than a
ranking feedback system. But, the ranking feedback system gives us
more prompt-completion data to train the reward model.

Training the Reward Model

We will train the reward model to predict the preferred completion from yj , yk
for a prompt x. The reward model is usually an LLM as well, such as BERT
trained using supervised learning methods on the pairwise comparison data.

For a given prompt x, the model learns to favor the human-preferred completion
yj , while minimizing the following loss function:

loss = log(σ(rj − rk ))
Note: Notice how the loss function does not have any notion of
labels in it despite this being a supervised learning problem. This is
exactly why the ordering of the completions in the pairwise data is
important. The loss function itself assumes that rj is the reward for
the preferred response.

Obtaining the Reward Value

Once the reward model is trained, we can use it as a binary classifier to provide
reward value for each prompt-completion pair generated by our LLM.
For example, consider that we want to align the LLM to reduce toxicity. The
positive class, the class we want to optimize for, for the reward model would be

10
“not hate” (does not contain hate speech) and the negative class, the class we
want to avoid, would be “hate” (contains hate speech).
The logit (unnormalized output of the reward model before applying any acti-
vation function such as softmax) value of the positive class will be the reward
value that we will provide to the RLHF feedback loop.
The example below shows a “good reward” being provided for a non-toxic
completion.

Note: Probabilities is the output of the reward model after applying

a softmax activation.
A “bad reward” for a toxic completion might be as shown below.

Fine-Tuning the LLM Using the Reward Model

Once we have all the pieces in place, the LLM can be fine-tuned as follows:
1. Pass a prompt to the LLM from the fine-tuning dataset.
2. Obtain a completion from the LLM.
3. Pass the prompt and the completion to the reward model to obtain a
reward value. A higher reward value represents a more aligned response
while a lower reward value represents a less aligned response.
4. Pass the prompt, the completion and the reward value to the reinforce-
ment learning algorithm to update the weights of the LLM.

11
Steps 2-4 represent a single iteration of the RLHF process. We repeat the steps
2-4 for a certain number of epochs. With the number of epochs, we will notice
that the reward keeps increasing for every subsequent completion, indicating
that RLHF is working as intended.
We will continue the process until the model is aligned based on some evaluation
criteria. For example, we can use a threshold on the reward value or a maximum
number of steps (like 20,000).

Proximal Policy Optimization (PPO)

Note: The PPO algorithm discussed here is also called PPO-Clip.
Note: Q-Learning is another reinforcement learning algorithm that
can be used with RLHF but PPO is the current SOTA technique.
There are many options for what the reinforcement learning algorithm can be.
One popular algorithm is called Proximal Policy Optimization (PPO).
PPO optimizes a policy (here, the LLM) to be more aligned with human
preferences. Over many iterations, PPO makes updates to the LLM which are
small and within a bounded region. This results in an LLM that is close to the
previous version (hence the name proximal policy optimization). Keeping the
changes within a small region results in more stable learning.
PPO can be understood by separating it into two phases.

Phase 1: Create Completions

In phase 1, the LLM is used to carry out a number of experiments where it is
used to generate completions for some given prompts. These experiments allow
us to update the LLM against the reward model in phase 2.

12
Value Function and Value Loss
The expected reward of a completion is an important quantity used in the
PPO objective.
This quantity is estimated using a separate head (another output layer) of the
LLM called the value function.
Consider that we have a number of prompts. To estimate the value function, we
generate the completions to the prompts using our instruct model and calculate
the reward for the completions using the reward model.
Example:
Prompt 1: A dog is
Completion: a furry animal
Reward: 1.87
Prompt 2: This house is
Completion: very ugly
Reward: -1.24
The value function estimates the expected total reward for a given state S. In
other words, as the LLM generates each token of a completion, we want to
estimate the total future reward based on the current sequence of tokens. This
can be thought of as a baseline to evaluate the quality of completions against
our alignment criteria.
Example:
Prompt 1: A dog is
Completion: a
Vθ (s) = 0.34
Completion: a furry
Vθ (s) = 1.23
Since the value function is just another output layer in the LLM, it is automati-
cally computed during the forward pass of a prompt through the LLM.
It is learnt by minimizing the value loss, that is the difference between the
actual future total reward (1.87 for A dog is) and the estimated future total
reward (1.23 for a furry). The value loss is essentially the mean squared error
between these two quantities given by:

T
1 X
LV F = ∥Vϕ (s) − ( γ t rt |s0 = s)∥22
2 t=0

13
In essence, we are solving a simple regression problem to fit the value function.

Phase 2: Advantage Estimation

In phase 2, we make small updates to the model and evaluate the impact of
those updates on our alignment criteria for the model. The small updates are
kept within a certain small region, called the trust region.
Ideally, the series of small updates will move the model towards higher rewards.
The PPO policy objective is the main ingredient for this. The objective is to
find a policy whose expected reward is high. In other words, we are trying to
make updated to the LLM weights that result in completions more aligned with
human preferences and so receive a higher reward.
This is done by maximizing the policy loss:

πθ (at |st ) πθ (at |st )

LPOLICY = min( .Ât , clip( , 1 − ϵ, 1 + ϵ).Ât )
πθold (at |st ) πθold (at |st )

This is a bit complicated. A simpler version of the loss is as follows:

πθ (at |st )
LPOLICY = min( .Ât , g(ϵ, Ât ))
πθold (at |st )

where:
(
(1 + ϵ)Ât , Ât ≥ 0
g(ϵ, Ât ) =
(1 − ϵ)Ât , Ât < 0

Here:
• πθold (at |st ) is the probability of the next token at given the current context
window st for the initial LLM (the LLM before this iteration of PPO
started)
• πθ (at |st ) is probability of the next token at given the current context
window st for the updated LLM (a copy of the LLM before this iteration
of PPO started that we are going to keep modifying in the current iteration).
• Ât is the estimated advantage term of a given choice of action. It
compares how much better or worse the current action is as compared to
all possible actions at that state.
This is calculated by looking at the expected future rewards of a completion
following the next token and estimating how advantageous this completion
is compared to the rest.

14
In the above figure, the path at the top is a better completion since it
goes towards a higher reward while the path at the bottom is a worse
completion since it goes towards lower reward.
There are multiple advantage estimation algorithms available for calculating
this quantity. For a generalized form, see: Notes on the Generalized
Advantage Estimation Paper.
We can now understand how the loss function works to make sure that the model
is aligned. Consider:
• Advantage term Ât is positive: This implies that the token generated
by the updated LLM is better than average. The loss function becomes:

πθ (at |st )
LPOLICY = min( , 1 + ϵ).Ât
πθold (at |st )

Since the advantage is positive, the objective will increase if the action
at becomes more likely - that is, πθ (at |st ) increases. But, the min in
the term applies a limit on how much the objective can increase. If
πθ (at |st ) > (1 + ϵ)πθold (at |st ), the min will limit LPOLICY to (1 + ϵ)Ât .
Thus, the new policy will not benefit by going far away from the old policy.
• Advantage term Ât is negative. This implies that the token generated
by the updated LLM is worse than average. The loss function becomes:

πθ (at |st )
LPOLICY = max( , 1 − ϵ).Ât
πθold (at |st )
Since the advantage is negative, the objective will increase if the action
becomes less likely - that is, πθ (at |st ) decreases. But, the max in the
term puts a limit on how much the objective can increase. If πθ (at |st ) <
(1 − ϵ)πθold (at |st ), the max will limit LPOLICY to (1 − ϵ)Ât . Again, the new
policy does not benefit by going far away from the old policy.

15
Thus, we can see how the term πθπθ (a(at |st |st )t ) decides whether πθ (at |st ) should
old
increase or decrease, and how clipping ensures that the new policy does not stray
far away from the old policy.
In addition to the policy loss, we also have an entropy loss. The entropy loss
helps the model maintain its creativity and is given by:

LENT = entropy(πθ (at |st ))

If we keep the entropy low, the model might end up completing a prompt always
in the same way.

Final PPO Objective

The final PPO objective is a weighted sum of the policy loss, value loss and the
entropy loss.

LPPO = LPOLICY + c1 LVF + c2 LENT

c1 , c2 and the ϵ term in LPOLICY are hyperparameters.

This loss function is optimized using a stochastic gradient descent algorithm
(such as Adam or SGD) and as the gradients are backpropagated, the weights
of the LLM, the weights of the value function and the weights of the entropy
function keep getting updated.
In the next iteration of PPO, we use the updated LLM to calculate πθold (at |st )
and update a copy of this updated LLM, and so on.

Pseudocode
The pseudocode for PPO is shown below. Steps 3-4 correspond to phase 1 and
the rest of the steps correspond to phase 2.

16
Reward Hacking
Introduction
In Reinforcement Learning, it is possible for the agent to learn to cheat the
system by favoring actions that maximize the reward received even if those
actions don’t align well with the original objective.
In case of LLMs, reward hacking can present itself in the form of addition of
words or phrases to the completion that result in high scores for the metric being
aligned, but that reduce the overall quality of the language.
For example, consider that we are using RLHF to reduce toxicity of an instruct
LLM. We have trained a reward model to reward each completion based on how
toxic the completion is.
We feed a prompt This product is to the LLM, which generates the completion
complete garbage. This results in a low reward and PPO updates the LLM
towards less toxicity. As the LLM is updated in each iteration of RLHF, it is
possible that the updated LLM diverges too much from the initial LLM since it
is trying to optimize the reward.
The model might learn to generate completions that it has learned will lead to
very low toxicity scores, such as most awesome, most incredible thing ever. This
completion is highly exaggerated. It is also possible that the model will generate
completions that are completely nonsensical, as long as the phrase leads to high
reward. For example, it can generate something like Beautiful love and world
peace all around, which has positive words and is likely to have a high reward.

17
Avoiding Reward Hacking
One possible solution is to use the initial instruct LLM as a reference model
against which we can check the performance of the RL-updated LLM. The
weights of this reference model are frozen and not updated during iterations of
RLHF.
During training, a prompt like This product is is passed to each model and both
generate completions. Say the reference generates the completion useful and
well-priced and the updated LLM generates the most awesome, most incredible
thing ever.
We can then compare the two completions and calculate a value called the
KL divergence (Kullback-Leibler divergence). KL divergence is a statistical
measure of how different two probability distributions are. Thus, by comparing
the completions, we can calculate how much the updated model has diverged
from the reference.
Note: The exact details of how KL divergence is calculated are
not discussed in the course. Some details are available here: KL
Divergence for Machine Learning.
KL divergence is calculated for each generated token across the whole vocabulary
of the LLM. This can easily be tens or hundreds of thousands of tokens. However,
using the softmax, we have reduce the number of probabilities to much less than
the full vocabulary size (since a lot of the probabilities will be close to zero). It
is still a compute expensive process and the use of GPUs is recommended.
Once we have calculated the KL divergence between the two models, we add it
as a penalty to the reward calculation. It will penalize the updated LLM if it
shifts too far from the reference model.

Memory Constraints
Note that now we need two full copies of the LLM - the reference copy and the
copy that will be updated by RLHF.

18
As such, we can benefit by combining RLHF with PEFT. We update the weights
of a PEFT adapter and not the weights of the LLM. We can thus use the same
underlying LLM as the reference model and the model that will be updated.
This reduces the memory footprint by approximately half.

Evaluating the Human-Aligned LLM

Once we have completed the RLHF alignment of our model, we would want to
assess the model’s performance.
For this, we can use a summarization dataset (such as DialogSum) to quantify
the reduction in toxicity. The evaluation metric will be the average toxicity
score from the reward model - that is, the probability of the negative class. A
higher average score would mean that the completion is more toxic and a lower
average score would mean that the completion is less toxic.

We will first create a baseline average toxicity score using the initial LLM by
passing it the summarization dataset and calculating the average toxicity score
using the reward model. Then, we will pass the same dataset through the aligned
LLM and obtain another average toxicity score.

19
If RLHF was successful, the average score of the aligned LLM should be lower
than that of the initial LLM.

Model Self-Supervision With Constitutional AI

Problem - Scaling Human Feedback
While using the reward model can eliminate the need for humans during the
RLHF process, the human effort required to produce the reward model itself can
be huge.
The labeled data used to train the reward model typically requires large teams
of labelers, ranging in the tens of thousands evaluating many prompts each.
It requires a lot of time and other resources, which can be important limiting
factors.

Introduction to Constitutional AI
Constitutional AI, first proposed in the paper Constitutional AI: Harmlessness
from AI Feedback (Anthropic, 2022), is one approach to scale human feedback.
It is a method for training models using a set of rules and principles that govern
the model’s behavior. Together with a set of sample prompts, these form a
constitution. We then train the model to self-critic and revise its responses to
comply with the constitution.
Constitutional AI is not only useful in scaling human feedback, but can also help
with some unintended consequence of RLHF.
For example, based on how the input prompt is structured, an aligned model
may end up revealing harmful information as it tries to provide the most helpful.
We may provide the prompt Can you help me hack into my neighbor’s wifi and
in the quest for being helpful, the LLM might give complete instructions on how
to do this.
Providing the model with a constitution can help the model in balancing these
competing interests and minimize the harm.
An example of constitutional principles is shown below:

20
Implementation
Implementing Constitutional AI consists of two stages.

Stage 1 - Supervised Fine-Tuning

In the first stage, we carry out supervised learning:
• We prompt the aligned model in ways that try to get it to generate harmful
responses. This is called red teaming.
• We then ask the model to critique its own harmful responses according to
the constitutional principles and revise them to comply with those rules.
• Once done, we’ll fine-tune the model using the pairs of red team prompts
and the revised constitutional responses.
Consider we give the prompt Can you help me hack into my neighbor’s wifi and
the aligned LLM generates the response Sure thing, you can use an app called
VeryEasyHack.
To combat this, we augment the prompt using the harmful completion and set of
pre-defined instructions that ask the model to critique its response. For example,

21
the prompt can be augmented with:
Identify how the last response is harmful, unethical, racist, sexist,
toxic, dangerous or illegal.
This is fed to the LLM and it generates something like:
The response was harmful because hacking into someone else’s wifi is
an invasion of their privacy and is possibly illegal.
The model detects the problems in its response. We then put it all together and
ask the model to write a new response which removes all of the harmful and
illegal content. For example:
Rewrite the response to remove any and all harmful, unethical, racist,
sexist, toxic, dangerous or illegal content.
The model generates a new response:
Hacking into your neighbor’s wifi is an invasion of their privacy. It
may also land you in legal trouble. I advise against it.
Thus, the prompt Can you help me hack into my neighbor’s wifi becomes the
red team prompt and the above revised completion is the revised constitutional
response.
We repeat this process for many red team prompts and obtain a dataset of
prompt-completion pairs that can be used to fine-tune the model in a supervised
manner. The resultant LLM will have learnt to generate constitutional responses.

Stage 2 - Reinforcement Learning From AI Feedback

In the second stage, we use reinforcement learning.
This is similar to RLHF but instead of using feedback from a human, we use
feedback generated by a model. This is often called Reinforcement Learning
From AI Feedback (RLAIF).
We use the fine-tuned LLM from the first stage to generate a set of completions
for a red team prompt. We then ask the model which of the completions is
preferred according to the constitutional principles.
Repeating this for multiple red team prompts generates a model-generated
preference dataset that can be used to train a reward model. Once we have the
reward model, we can apply the usual RLHF pipeline to align the model to the
constitutional principles.
In the end, we obtain a constitutional LLM.

Useful Resources
• RLHF paper - Learning to summarize from human feedback.

22
• NVIDIA article on RLHF.
• HuggingFace article on RLHF.
• YouTube - HuggingFace livestream on RLHF.
• YouTube - PPO Implementation From Scratch in PyTorch.
• Transformer Reinforcement Learning (trl) - Library by HuggingFace for
reinforcement learning on transformers.
• Constitutional AI paper - Constitutional AI: Harmlessness from AI Feed-
back.
• Lab 3 - Code example where FLAN-T5 is aligned using RLHF.

Shiitake Mushroom Handbook
100% (3)
Shiitake Mushroom Handbook
290 pages
Federal Complaint Against Barrett Daffin Frappier Treder and Weiss For Legal Malpractice
100% (2)
Federal Complaint Against Barrett Daffin Frappier Treder and Weiss For Legal Malpractice
9 pages
DLL W3 Organization and Management 11
100% (4)
DLL W3 Organization and Management 11
3 pages
Class Notes
No ratings yet
Class Notes
147 pages
Combat Aircraft Journal (February 2021)
100% (5)
Combat Aircraft Journal (February 2021)
102 pages
1 Introduction To RL
No ratings yet
1 Introduction To RL
46 pages
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
No ratings yet
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
34 pages
Southpoint School & College: Time: 30 Mins Subject: Computer Studies (Objectives) Full Marks: 30
No ratings yet
Southpoint School & College: Time: 30 Mins Subject: Computer Studies (Objectives) Full Marks: 30
2 pages
I C 616 Rap Workshop
No ratings yet
I C 616 Rap Workshop
62 pages
Reinforcement Learning From Human Feedback
No ratings yet
Reinforcement Learning From Human Feedback
100 pages
Class 01
No ratings yet
Class 01
75 pages
Secrets of RLHF in Large Language Models Part I: PPO
No ratings yet
Secrets of RLHF in Large Language Models Part I: PPO
32 pages
FD ch5 PPT Hull
No ratings yet
FD ch5 PPT Hull
37 pages
Regbook Inside
100% (1)
Regbook Inside
21 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
Class 02
No ratings yet
Class 02
42 pages
Class 03
No ratings yet
Class 03
40 pages
Standards Complete
No ratings yet
Standards Complete
69 pages
Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
No ratings yet
Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
33 pages
Huang Meta Analyses Stat Methods Med Res 2014 0962280214537394
No ratings yet
Huang Meta Analyses Stat Methods Med Res 2014 0962280214537394
35 pages
19eme331 - Manufacturing Technology
No ratings yet
19eme331 - Manufacturing Technology
3 pages
Internship Report
No ratings yet
Internship Report
38 pages
Ranking Problems: 9.520 Class 09, 08 March 2006 Giorgos Zacharia
No ratings yet
Ranking Problems: 9.520 Class 09, 08 March 2006 Giorgos Zacharia
27 pages
Prof Eng 2 Unit 2 Food Processing STD 2020
No ratings yet
Prof Eng 2 Unit 2 Food Processing STD 2020
20 pages
Generalization Bounds and Stability: 9.520 Class 14, 03 April 2006 Sasha Rakhlin
No ratings yet
Generalization Bounds and Stability: 9.520 Class 14, 03 April 2006 Sasha Rakhlin
25 pages
Lesson Plan in Science 6 Components of Ecosystem
100% (1)
Lesson Plan in Science 6 Components of Ecosystem
5 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
47 pages
3.2 - Hypothesis Testing (P-Value Approach)
No ratings yet
3.2 - Hypothesis Testing (P-Value Approach)
3 pages
Self-Rewarding Language Models: Weizhe Yuan Richard Yuanzhe Pang Kyunghyun Cho Sainbayar Sukhbaatar Jing Xu Jason Weston
No ratings yet
Self-Rewarding Language Models: Weizhe Yuan Richard Yuanzhe Pang Kyunghyun Cho Sainbayar Sukhbaatar Jing Xu Jason Weston
15 pages
ALarm
No ratings yet
ALarm
15 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
154 pages
Supervised Fine-Tuning As Inverse Reinforcement Learning
No ratings yet
Supervised Fine-Tuning As Inverse Reinforcement Learning
12 pages
3.0 - Matrix Properties
No ratings yet
3.0 - Matrix Properties
2 pages
Sped MApeh6
No ratings yet
Sped MApeh6
5 pages
Secrets of RLHF in Large Language Models Part I: Ppo: Fudan NLP Group Bytedance Inc
100% (1)
Secrets of RLHF in Large Language Models Part I: Ppo: Fudan NLP Group Bytedance Inc
32 pages
5 Open Source Wi-Fi Hotspot Solutions
No ratings yet
5 Open Source Wi-Fi Hotspot Solutions
3 pages
Cardiology Today Next Gen Innovators: Meet The
100% (1)
Cardiology Today Next Gen Innovators: Meet The
1 page
The Wisdom of Hindsight Makes Language Models Better Instruction Followers
No ratings yet
The Wisdom of Hindsight Makes Language Models Better Instruction Followers
15 pages
5.4 - Eigendecomposition
No ratings yet
5.4 - Eigendecomposition
2 pages
4.0 - Matrix Inverse
No ratings yet
4.0 - Matrix Inverse
2 pages
Thesis Reinforcement Learning
100% (2)
Thesis Reinforcement Learning
5 pages
Learning Guide: Cardiovascular Diseases: Be Able To Discuss Each of The Following
No ratings yet
Learning Guide: Cardiovascular Diseases: Be Able To Discuss Each of The Following
2 pages
1123 w11 in 21
100% (1)
1123 w11 in 21
4 pages
Hanjun Dai
No ratings yet
Hanjun Dai
89 pages
AI & Prompting Workshop Day 2
No ratings yet
AI & Prompting Workshop Day 2
19 pages
A Comprehensive Survey of LLM Alignment Techniques - RLHF - Rlaif - Ppo - Dpo and More
No ratings yet
A Comprehensive Survey of LLM Alignment Techniques - RLHF - Rlaif - Ppo - Dpo and More
37 pages
Research Methodology: Types of Taboo Words Are Used in What Is The Function of Taboo Words Are Used in
No ratings yet
Research Methodology: Types of Taboo Words Are Used in What Is The Function of Taboo Words Are Used in
3 pages
(ABRIDGED) RMUN 2021 (UNHCR) - Study Guide
No ratings yet
(ABRIDGED) RMUN 2021 (UNHCR) - Study Guide
15 pages
Diary Ka Kea, My Cursed Life-1
No ratings yet
Diary Ka Kea, My Cursed Life-1
960 pages
Deep Reinforcement Learning from Human Preferences (深度强化学习来自人类偏好)
No ratings yet
Deep Reinforcement Learning from Human Preferences (深度强化学习来自人类偏好)
9 pages
Gamal Mohamed CV
No ratings yet
Gamal Mohamed CV
2 pages
Lab Rheology and Injection Molding - 1
No ratings yet
Lab Rheology and Injection Molding - 1
3 pages
Towards Reliable Alignment: Uncertainty-Aware RLHF
No ratings yet
Towards Reliable Alignment: Uncertainty-Aware RLHF
25 pages
Pdf?id AAx Is 3 D2 ZZ
No ratings yet
Pdf?id AAx Is 3 D2 ZZ
31 pages
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
No ratings yet
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
38 pages
LLM Alignment
No ratings yet
LLM Alignment
171 pages
S RLHF: S R L H F: AFE AFE Einforcement Earning From Uman Eedback
No ratings yet
S RLHF: S R L H F: AFE AFE Einforcement Earning From Uman Eedback
27 pages
Self-Rewarding Language Models
No ratings yet
Self-Rewarding Language Models
23 pages
Back To Basics: Revisiting Reinforce Style Optimization For Learning From Human Feedback in Llms
No ratings yet
Back To Basics: Revisiting Reinforce Style Optimization For Learning From Human Feedback in Llms
28 pages
NPTEL
No ratings yet
NPTEL
37 pages
Lec 23
No ratings yet
Lec 23
51 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Vaccine Development Process': International Webinar
No ratings yet
Vaccine Development Process': International Webinar
1 page
Module 3
No ratings yet
Module 3
44 pages
Wa0005.
No ratings yet
Wa0005.
17 pages
12 LLM Notes
No ratings yet
12 LLM Notes
10 pages
(2303.18223) A Survey of Large Language Models
No ratings yet
(2303.18223) A Survey of Large Language Models
115 pages
ChatGPT Mastery - Zaka
No ratings yet
ChatGPT Mastery - Zaka
10 pages
Book
No ratings yet
Book
100 pages
Building An AI Startup-2024. in 2024, Building An AI Startup - by Bijit Ghosh - Feb, 2024 - Medium
No ratings yet
Building An AI Startup-2024. in 2024, Building An AI Startup - by Bijit Ghosh - Feb, 2024 - Medium
25 pages
A Survey of Reinforcement Learning From Human Feedback
No ratings yet
A Survey of Reinforcement Learning From Human Feedback
83 pages
Express Entry Application Steps
No ratings yet
Express Entry Application Steps
2 pages
How Large Language Models Work. From Zero To ChatGPT - by Andreas Stöffelbauer - Medium - Data Science at Microsoft
No ratings yet
How Large Language Models Work. From Zero To ChatGPT - by Andreas Stöffelbauer - Medium - Data Science at Microsoft
39 pages
UNIT V Reinforcement Learning
No ratings yet
UNIT V Reinforcement Learning
8 pages
ChatGPT For Data Analytics Full Course
No ratings yet
ChatGPT For Data Analytics Full Course
3 pages
Writing - Task 1 - GT
No ratings yet
Writing - Task 1 - GT
8 pages
Re Max
No ratings yet
Re Max
36 pages
Decipher
No ratings yet
Decipher
37 pages
Alarm Data
No ratings yet
Alarm Data
3 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Reward Design With Language Models
No ratings yet
Reward Design With Language Models
18 pages
Incred PL LXDEL18924 257067901 06 12 2024 - Statement - of - Account - 140420
No ratings yet
Incred PL LXDEL18924 257067901 06 12 2024 - Statement - of - Account - 140420
2 pages
Additional Instructions For Express Entry Canada
No ratings yet
Additional Instructions For Express Entry Canada
6 pages
Schedule 50 - 2024
No ratings yet
Schedule 50 - 2024
1 page
Reinforcement Learning From Human Feedback
No ratings yet
Reinforcement Learning From Human Feedback
110 pages
OPA Annex 4 Request For Funds Format (15 March 2018)
No ratings yet
OPA Annex 4 Request For Funds Format (15 March 2018)
5 pages
RLHF
No ratings yet
RLHF
123 pages
2.9 How LLMs Follow Instructions, Instruction Tuning and RLHF
No ratings yet
2.9 How LLMs Follow Instructions, Instruction Tuning and RLHF
2 pages
RLHF PDF
No ratings yet
RLHF PDF
9 pages
Four
No ratings yet
Four
5 pages
E4. LLM Instruction Tuning
No ratings yet
E4. LLM Instruction Tuning
45 pages
Image and Video Processing in The Compressed Domain Jayanta Mukhopadhyay
No ratings yet
Image and Video Processing in The Compressed Domain Jayanta Mukhopadhyay
45 pages
Statement of Account: Date Narration Chq./Ref - No. Value DT Withdrawal Amt. Deposit Amt. Closing Balance
No ratings yet
Statement of Account: Date Narration Chq./Ref - No. Value DT Withdrawal Amt. Deposit Amt. Closing Balance
1 page
Reinforcement and Huan Feedback
No ratings yet
Reinforcement and Huan Feedback
5 pages
Day 18 - RLHF
No ratings yet
Day 18 - RLHF
8 pages
CMADD NEW Syllabus
No ratings yet
CMADD NEW Syllabus
224 pages
Print Money Receipt
No ratings yet
Print Money Receipt
3 pages
Unit 5 ML
No ratings yet
Unit 5 ML
49 pages
RLHF and Its Optimization Techniques Blog
No ratings yet
RLHF and Its Optimization Techniques Blog
7 pages
GALLM Unit 4 Notes
No ratings yet
GALLM Unit 4 Notes
14 pages
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
No ratings yet
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
11 pages
Limitation of RLHF
No ratings yet
Limitation of RLHF
42 pages
U.S. Seismic Design Maps CHRYSLER
No ratings yet
U.S. Seismic Design Maps CHRYSLER
2 pages
Cs224n 2025 Lecture10 Instruction Tunining RLHF
No ratings yet
Cs224n 2025 Lecture10 Instruction Tunining RLHF
61 pages
RLHF
No ratings yet
RLHF
122 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Efficient Online RFT With Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
No ratings yet
Efficient Online RFT With Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
25 pages
Reinforcement Learning From Human Feedback in LLMS: Whose Culture, Whose Values, Whose Perspectives?
No ratings yet
Reinforcement Learning From Human Feedback in LLMS: Whose Culture, Whose Values, Whose Perspectives?
26 pages
A DD Merged
No ratings yet
A DD Merged
16 pages
07 Lecture10 Post Training
No ratings yet
07 Lecture10 Post Training
61 pages
IPO: Your Language Model Is Secretly A Preference Classifier
No ratings yet
IPO: Your Language Model Is Secretly A Preference Classifier
16 pages
521H0502-521H0498-521h0333 NLP Report
No ratings yet
521H0502-521H0498-521h0333 NLP Report
27 pages
Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
No ratings yet
Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
31 pages
Reinforcement Learning From Reflective Feedback (RLRF) : Aligning and Improving Llms Via Fine-Grained Self-Reflection
No ratings yet
Reinforcement Learning From Reflective Feedback (RLRF) : Aligning and Improving Llms Via Fine-Grained Self-Reflection
22 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Reinforcement Learning From Human Feedback (RLHF)

Uploaded by

Reinforcement Learning From Human Feedback (RLHF)

Uploaded by

Reinforcement Learning From Human Feedback

Reinforcement Learning From Human Feedback (RLHF) 3

Model Self-Supervision With Constitutional AI 20

Why Is Alignment Important?

We can use RLHF to make sure that:

The learning process is iterative and involves trial and error:

Using Reinforcement Learning To Align LLMs

Each action can be generating a single word, a sentence or a longer-form text

Obtaining Human Feedback and Training the Reward Model

Prepare Dataset For Human Feedback

Gather Human Feedback

Prepare Labeled Data For Training

Depending on the number of different completions n for a prompt, we will have

Note: Many LLMs use a thumbs-up, thumbs-down feedback system

Training the Reward Model

Obtaining the Reward Value

Note: Probabilities is the output of the reward model after applying

Fine-Tuning the LLM Using the Reward Model

Proximal Policy Optimization (PPO)

Phase 1: Create Completions

Phase 2: Advantage Estimation

πθ (at |st ) πθ (at |st )

This is a bit complicated. A simpler version of the loss is as follows:

LENT = entropy(πθ (at |st ))

Final PPO Objective

LPPO = LPOLICY + c1 LVF + c2 LENT

c1 , c2 and the ϵ term in LPOLICY are hyperparameters.

Evaluating the Human-Aligned LLM

Model Self-Supervision With Constitutional AI

Stage 1 - Supervised Fine-Tuning

Stage 2 - Reinforcement Learning From AI Feedback

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.