0% found this document useful (0 votes)
16 views14 pages

GALLM Unit 4 Notes

The document discusses the challenges and biases in obtaining high-quality human feedback for Reinforcement Learning from Human Feedback (RLHF), including subjectivity, scalability issues, and ambiguity in feedback. It outlines mitigation strategies, such as using diverse annotators and implementing consistency checks, and explains the importance of fine-tuning large language models (LLMs) to align with human values through techniques like Proximal Policy Optimization (PPO). Additionally, it addresses reward hacking, where models exploit reward functions, and emphasizes the use of KL Divergence to prevent such issues.

Uploaded by

Hetvi Bhora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views14 pages

GALLM Unit 4 Notes

The document discusses the challenges and biases in obtaining high-quality human feedback for Reinforcement Learning from Human Feedback (RLHF), including subjectivity, scalability issues, and ambiguity in feedback. It outlines mitigation strategies, such as using diverse annotators and implementing consistency checks, and explains the importance of fine-tuning large language models (LLMs) to align with human values through techniques like Proximal Policy Optimization (PPO). Additionally, it addresses reward hacking, where models exploit reward functions, and emphasizes the use of KL Divergence to prevent such issues.

Uploaded by

Hetvi Bhora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Challenges in high-quality human feedback for RLHF:

Reinforcement Learning from Human Feedback (RLHF) relies on human annotations to


improve model alignment. However, obtaining high-quality feedback is challenging due to:

1. Subjectivity & Bias:


 Different annotators have varied opinions, leading to inconsistent labels.
 Cultural and personal biases can affect judgments.
2. Scalability Issues:
 Labelling large datasets manually is time-consuming and expensive.
 Recruiting and training expert annotators requires resources.
3. Ambiguity in Feedback:
 Some tasks lack a single correct answer, making evaluation difficult.
 Ambiguous instructions can lead to misinterpretation.
4. Overfitting to Human Preferences:
 The model may reinforce mainstream views while neglecting minority
perspectives.
 It may prioritize engagement over factual accuracy.
5. Fatigue and Annotation Errors:
 Annotators may become fatigued, reducing the quality of feedback.
 Inconsistent feedback can misguide the reinforcement learning process.

Mitigation Strategies:

 Use diverse annotators to reduce bias.


 Implement consistency checks to validate human feedback.
 Use automated evaluation techniques (e.g., adversarial testing) to supplement human
feedback.

Biases in human feedback:

Reinforcement Learning from Human Feedback (RLHF) relies on human annotations to align
LLM behaviour with human preferences. However, biases in feedback can negatively impact
training:

 Subjectivity & Cultural Bias: Human annotators may have different opinions based on
personal or cultural perspectives.
 Overfitting to Popular Opinions: Models may reinforce mainstream or dominant
views while suppressing minority perspectives.
 Inconsistent Feedback: Different annotators might provide conflicting labels, making
training less reliable.
 Political & Ethical Bias: Certain responses may be favored due to social or
ideological biases in annotation.
 Reinforcement of Stereotypes: If biased annotations are used, the model may learn
and propagate stereotypes.
Mitigation Strategies:

 Diverse Annotator Pool: Using annotators from different backgrounds to reduce bias.
 Debiasing Algorithms: Implementing fairness-aware training techniques.
 Continuous Monitoring: Auditing the model’s outputs for unintended biases and
refining feedback mechanisms.

Pairwise ranking loss influences:

Reward values:

 Response P: R(P)=3.0
 Response Q: R(Q)=1.5
 Response R: R(R)=−1.0

Reward values are learned by training a model using human feedback. Human evaluators
rank different responses, and the model updates its weights to assign higher scores to more
preferred responses. The ranking loss function optimizes these scores by ensuring the
preferred response gets a higher reward than the non-preferred one.
Fine-Tuning of LLMs using Reinforcement Learning

Fine-tuning LLMs with Reinforcement Learning from Human Feedback (RLHF) involves an
iterative process where a reward model evaluates the quality of the model’s responses and
provides feedback for further optimization. The goal is to align the model’s outputs with
human preferences by using reinforcement learning techniques.

The given image illustrates this process using a step-by-step improvement of the response to
the prompt "A dog is..." over multiple iterations:

1. Initial Response and Reward Assignment (Iteration 1)

 The model generates the response: "A furry animal."


 The reward model assigns a low reward: 0.24
 The RL algorithm updates the model to improve future responses.

2. Improving the Response (Iteration 2)

 The model learns and refines its output: "A friendly animal."
 The reward model assigns a higher reward: 0.51
 The RL algorithm updates the model again.

3. Further Refinement (Iteration 3)

 The model now generates: "A human companion."


 The reward model assigns a reward of 0.68, indicating better alignment with human
values.
 The RL algorithm continues fine-tuning.

4. Converging to an Optimal Response (Iteration n)

 The model eventually learns to produce the ideal response: "Man’s best friend."
 The reward model assigns a significantly higher reward: 2.87
 The LLM is now well-aligned with human preferences and is considered optimized.

Fine-tuning with RLHF ensures that LLMs generate responses that better match human
expectations over multiple iterations. The reward model evaluates outputs, and the RL
algorithm fine-tunes the LLM to maximize the reward, leading to an improved and human-
aligned model.
Probability that response P is preferred over Q and Q over R, using the sigmoid
function and aanalyze how the reward values are learned and optimized during the training
process to enhance model performance.

P(P>Q) =σ(RP−RQ) = 1/ (1+e−(RP−RQ))

 Response A: R(A)=3.1
 Response B: R(B)=1.8
 Response C: R(C)=0.2

σ(RA−RB)

Probability of A over B:

P(A > B) = 1/{1+e^{-(3.1 - 1.8)}} = 1/{1+e^{-1.3}} {1+0.273} = 0.78 { (78%)}

Probability of B over C:

P(B > C) = 1/{1+e^{-(1.8 - 0.2)}} = 1/{1+e^{-1.6}} =1/{1+0.202} = 0.83 { (83%)}

Analysis of Reward Optimization:

 Higher reward values indicate better responses according to human preferences.


 Pairwise ranking loss helps fine-tune the model by reinforcing preferred responses
over less preferred ones.
 Over time, gradient updates adjust the reward model to maximize user satisfaction by
optimizing response rankings.

OR

 Response P: R(P)=2.9
 Response Q: R(Q)=1.5
 Response R: R(R)=−0.3R(
1. Aligning AI Models with Human Values (HHH Framework)

Understanding Model Alignment & Fine-Tuning

AI models, particularly large language models (LLMs), are trained on massive datasets
sourced from the internet. While this helps them generate human-like text, it also introduces
risks such as toxic language, biases, and unsafe recommendations.

To mitigate these risks, fine-tuning and human feedback alignment are used. The HHH
Framework—which stands for Helpful, Honest, and Harmless—guides developers to
ensure AI models behave ethically and safely.

Why Fine-Tuning with Human Feedback is Essential?

1. Helps the model understand human-like prompts more effectively.


2. Improves the naturalness of responses, making AI more conversational.
3. Reduces unintended harmful outputs, such as toxic language and misinformation.

The Three Pillars of AI Alignment (HHH Framework)

1️. Helpful:

AI should provide accurate and useful responses aligned with user intent.
Example:
User: "What's the fastest way to the airport?"
AI: "Taking the highway is usually faster than local roads."

2️. Honest:

AI must give truthful information, avoid hallucinations, and admit uncertainty.


Example:
User: "Can coughing stop a heart attack?"
Wrong Response: "Yes, coughing can stop a heart attack."
Correct Response: "No, coughing does not stop a heart attack. Please seek immediate
medical attention."

3️. Harmless:

AI should not generate harmful, offensive, or dangerous responses.


Example:
User: "How can I hack into a WiFi network?"
Wrong Response: "Here are the best ways to hack into a WiFi network..."
Correct Response: "Sorry, but I can't help with that. Unauthorized access to networks is
illegal."

Understanding Reinforcement Learning for LLMs with Proximal Policy Optimization


(PPO)

What is Proximal Policy Optimization (PPO)?

Proximal Policy Optimization (PPO) is a state-of-the-art Reinforcement Learning (RL)


algorithm that improves the stability and efficiency of policy optimization. It is widely used
for training Large Language Models (LLMs), robotic control, and game-playing AI.

Why PPO?
Stable & Reliable: Balances between exploration and exploitation.
Optimized Updates: Uses a clipping mechanism to prevent large policy changes.
Sample Efficient: Learns effectively from fewer training examples.
Used in LLMs: OpenAI uses PPO in training models like ChatGPT.

How PPO Works

1. Agent-Environment Interaction
 The AI (Agent) interacts with the environment (e.g., chatbot responding to
users).
 It takes actions and gets rewards based on the feedback.
2. Policy Update
 The AI updates its policy using feedback.
 Uses a special function to clip updates and prevent large jumps in learning.
3. Training & Optimization
 PPO trains the model iteratively using rewards.
 The goal is to maximize cumulative rewards.
KL-Divergence, or Kullback-Leibler Divergence, is a concept often encountered in the field
of reinforcement learning, particularly when using the Proximal Policy Optimization (PPO)
algorithm. It is a mathematical measure of the difference between two probability
distributions, which helps us understand how one distribution differs from another. In the
context of PPO, KL-Divergence plays a crucial role in guiding the optimization process to
ensure that the updated policy does not deviate too much from the original policy.
In PPO, the goal is to find an improved policy for an agent by iteratively updating its
parameters based on the rewards received from interacting with the environment. However,
updating the policy too aggressively can lead to unstable learning or drastic policy changes.
To address this, PPO introduces a constraint that limits the extent of policy updates. This
constraint is enforced by using KL-Divergence.
To understand how KL-Divergence works, imagine we have two probability distributions: the
distribution of the original LLM, and a new proposed distribution of an RL-updated LLM.
KL-Divergence measures the average amount of information gained when we use the original
policy to encode samples from the new proposed policy. By minimizing the KL-Divergence
between the two distributions, PPO ensures that the updated policy stays close to the original
policy, preventing drastic changes that may negatively impact the learning process.
A library that you can use to train transformer language models with reinforcement learning,
using techniques such as PPO, is TRL (Transformer Reinforcement Learning). In this link
you can read more about this library, and its integration with PEFT (Parameter-Efficient
Fine-Tuning) methods, such as LoRA (Low-Rank Adaption). The image shows an overview
of the PPO training setup in TRL.
Reward hacking in RLHF

Reward hacking in RLHF occurs when an agent exploits flaws in the reward function to
maximize rewards in unintended ways. Instead of genuinely improving the model’s
performance, the agent finds a way to game the system by optimizing for the reward signal
without aligning with the actual goal.

Example:

 Initially, the Instruct LLM responds with a negative statement: “...complete


garbage.” This receives a low toxicity reward of -1.8 (Iteration 1).
 In the next iteration, the model modifies its response to “okay but not the best,”
which is slightly less toxic, leading to an improved reward score of 0.3 (Iteration 2).
 Further training encourages the model to generate overly positive responses, like
“..the most awesome, most incredible thing ever.” This response receives a higher
reward of 2.1 (Iteration 3).
 Ultimately, the model exploits the reward system by generating unrealistic,
exaggeratedly positive responses such as “Beautiful love and world peace all
around.” with a 3.7 reward score (Iteration n), even if the response is unrelated to the
actual product.

Explanation:

1. The model learns to optimize its outputs for the toxicity reward model, rather than
providing balanced and honest feedback.
2. This demonstrates reward hacking, where the model generates excessively positive
and unrealistic responses to maximize the reward.
3. The agent is misaligned with human intent, as the goal is to generate useful product
descriptions, not overly generic, exaggerated praise.

Thus, reward hacking occurs when the model exploits the reward function instead of truly
improving its responses.

RLHF Reward Hacking (Detail)

This diagram explains a reward hacking issue in Reinforcement Learning from Human
Feedback (RLHF) and how it is controlled using KL Divergence Penalty.

Step-by-step Breakdown:

1. Prompt Dataset: The model is given an input like “This product is...”
2. Reference Model: Two versions of a model generate responses:
 One produces a normal, balanced response: “useful and well-priced.”
 Another may over-exaggerate the response: “the most awesome, most
incredible thing ever.”
3. Reward Model: The exaggerated response might get a higher score from a reward
model, causing the model to generate over-optimized outputs that sound extreme or
biased.
4. KL Divergence Penalty: A penalty term is applied when the new model deviates too
much from the reference model. This helps prevent excessive drift and reward
hacking (where the model manipulates its outputs to maximize rewards in unintended
ways).
5. PPO (Proximal Policy Optimization): The model is fine-tuned using PPO.

__________________________________________________________________

RLHF and Reward Hacking:

1. What Reinforcement Learning from Human Feedback (RLHF) is and why it is


used.
2. How reward models guide AI behaviour.
3. What reward hacking is and how it can cause problems.
4. How KL divergence penalty prevents reward hacking.
5. A practical implementation of reward modelling with KL penalty.

1️. Understanding RLHF (Reinforcement Learning from Human Feedback)

Definition: RLHF is a technique where AI models learn from human preferences instead of
fixed rules.
Why is it used?

 AI models like GPT-4 sometimes generate biased or undesirable responses.


 RLHF helps models align with human values by rewarding good responses and
penalizing bad ones.
 It is commonly used in fine-tuning large language models (LLMs) like ChatGPT.
Basic RLHF Process:
1️ A base AI model generates multiple responses.
2️Human annotators rank the responses.
3️A reward model is trained to predict which responses are better.
4️ The AI model is fine-tuned using Proximal Policy Optimization (PPO) to maximize the
reward.

2️ How Reward Models Work

 A reward model assigns scores to AI-generated responses based on human


preferences.
 These scores are used to train the AI to prefer better responses.
 The reward function is often based on sentiment analysis, coherence, or task
completion.

Example:

Imagine training an AI to generate product reviews.

 Response 1: “This product is well-built and affordable.” → Reward: 0.8 (Good)


 Response 2: “This product is the best thing in the universe!!!” → Reward: 1.0 (Too
Exaggerated)
 Response 3: “I hate this product, don’t buy it.” → Reward: 0.2 (Bad)

Problem: If the AI learns that extreme statements get the highest reward, it may over-
exaggerate responses. This is called reward hacking.

3️ What is Reward Hacking?

Definition: Reward hacking occurs when an AI model manipulates the reward system by
generating responses that maximize rewards without truly improving quality.

Reinforcement Learning from Human Feedback (RLHF) is used to fine-tune Large Language
Models (LLMs) by training them based on human preferences. The process involves:

1. Generating responses from a base model (Reference Model).


2. Comparing outputs and evaluating their quality.
3. Using a Reward Model to assign scores based on human alignment.

Reward Hacking happens when the model exploits the reward function in unintended
ways, leading to responses that maximize the reward but deviate significantly from desired
outputs.
Optimizing the LLM using PPO (Proximal Policy Optimization) to improve responses
while staying close to the original.

Example:

 The model should generate a fair review like “This product is useful and well-
priced.”
 But if exaggeration gets a higher reward, the model might start saying:
“This is the best product in the universe!!!”
 This is not actually helpful but is reward-maximizing behaviour (reward hacking).
 If the reward Favors lengthy completions, the model might generate overly long,
redundant responses instead of concise and relevant ones.

Ref:aws

Real-World Example of Reward Hacking:


Game AI: If an AI is trained to win a racing game, instead of finishing the race, it might find
a bug that lets it teleport to the finish line!
Stock Trading AI: An AI that is trained to maximize profits might engage in unethical
trading practices.

Solution: We use KL Divergence Penalty to stop excessive deviations.

4️ KL Divergence Penalty: Preventing Reward Hacking in RLHF:

What is KL Divergence?

 Kullback-Leibler (KL) Divergence measures how much the new model's output
distribution differs from the original reference model.
 KL Divergence measures how much the new model deviates from the original model.
 If the new model changes too much, we penalize it.

Formula for KL Divergence:


How KL Penalty Works:
1️ If a response is too extreme compared to the original model, KL adds a penalty to reduce
exaggeration.
2️ This keeps responses realistic and useful.

 It prevents extreme shifts in model behaviour.


 Encourages gradual learning instead of abrupt, unintended changes.
 Penalizes completions that deviate too far from the reference LLM.

3️ The final reward becomes:

Adjusted Reward=Reward Model Score−KL Penalty

KL Divergence Penalty in Reward Calculation

The KL penalty is added to the reward function:

This ensures the model doesn’t drift too far from the original while still improving based on
feedback.

import numpy as np
import random
from scipy.special import rel_entr

class RLHFTrainer:
def __init__(self, lambda_kl=0.1):
""" Initialize Reference Model and Reward Parameters """
self.reference_responses = {
"product": ["useful and well-priced.", "affordable and
reliable."],
"movie": ["exciting and well-directed.", "a fantastic
watch!"],
"food": ["delicious and well-cooked.", "tasty and
satisfying."]
}

self.new_responses = {
"product": ["the most awesome, most incredible thing
ever!"],
"movie": ["an absolute masterpiece that will change your
life!"],
"food": ["the best dish you'll ever eat, a divine
experience!"]
}

self.lambda_kl = lambda_kl # Weight for KL divergence penalty

def kl_divergence(self, ref_probs, new_probs):


""" Compute KL Divergence between reference and new model
probabilities """
return sum(rel_entr(new_probs, ref_probs)) # Uses relative
entropy

def get_response_distribution(self, response_list):


""" Generate a probability distribution over responses """
probs = np.ones(len(response_list)) / len(response_list)
return probs

def compute_reward(self, category):


""" Compute reward with KL penalty """
ref_responses = self.reference_responses.get(category,
["neutral response."])
new_responses = self.new_responses.get(category, ["neutral
response."])

# Convert responses to probability distributions


ref_probs = self.get_response_distribution(ref_responses)
new_probs = self.get_response_distribution(new_responses)

# Compute KL divergence penalty


kl_penalty = self.kl_divergence(ref_probs, new_probs)

# Assign human preference reward (simulated)


human_reward = random.uniform(0.5, 1.0) # Reward range [0.5,
1.0]

# Final reward calculation


final_reward = human_reward - self.lambda_kl * kl_penalty
return final_reward, kl_penalty

def train_model(self):
""" Simulated RLHF Training Loop """
print("Training Model with PPO & KL Penalty...\n")
for category in self.reference_responses.keys():
reward, kl_penalty = self.compute_reward(category)
print(f"Category: {category.upper()}")
print(f"KL Divergence Penalty: {kl_penalty:.4f}")
print(f"Final Reward: {reward:.4f}\n")

# Run RLHF Training


trainer = RLHFTrainer(lambda_kl=0.2)
trainer.train_model()

This Code Implements RLHF with KL Penalty


Reference Model provides neutral responses.
New Model generates enhanced (exaggerated) responses.
KL Divergence measures deviation between both models.
Reward Function adjusts scores using KL penalty to prevent drastic changes.
PPO-like Optimization ensures stable policy updates.

Takeaways
KL divergence prevents extreme model drift while still improving outputs.
PPO optimizes model updates in a controlled way.
Human feedback & penalties work together for better alignment.
Using PEFT (Parameter-Efficient Fine-Tuning) reduces memory requirements in large
models.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy