0% found this document useful (0 votes)

16 views14 pages

GALLM Unit 4 Notes

The document discusses the challenges and biases in obtaining high-quality human feedback for Reinforcement Learning from Human Feedback (RLHF), including subjectivity, scalability issues, and ambiguity in feedback. It outlines mitigation strategies, such as using diverse annotators and implementing consistency checks, and explains the importance of fine-tuning large language models (LLMs) to align with human values through techniques like Proximal Policy Optimization (PPO). Additionally, it addresses reward hacking, where models exploit reward functions, and emphasizes the use of KL Divergence to prevent such issues.

Uploaded by

Hetvi Bhora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views14 pages

GALLM Unit 4 Notes

Uploaded by

Hetvi Bhora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Challenges in high-quality human feedback for RLHF:

Reinforcement Learning from Human Feedback (RLHF) relies on human annotations to

improve model alignment. However, obtaining high-quality feedback is challenging due to:

1. Subjectivity & Bias:

 Different annotators have varied opinions, leading to inconsistent labels.
 Cultural and personal biases can affect judgments.
2. Scalability Issues:
 Labelling large datasets manually is time-consuming and expensive.
 Recruiting and training expert annotators requires resources.
3. Ambiguity in Feedback:
 Some tasks lack a single correct answer, making evaluation difficult.
 Ambiguous instructions can lead to misinterpretation.
4. Overfitting to Human Preferences:
 The model may reinforce mainstream views while neglecting minority
perspectives.
 It may prioritize engagement over factual accuracy.
5. Fatigue and Annotation Errors:
 Annotators may become fatigued, reducing the quality of feedback.
 Inconsistent feedback can misguide the reinforcement learning process.

Mitigation Strategies:

 Use diverse annotators to reduce bias.

 Implement consistency checks to validate human feedback.
 Use automated evaluation techniques (e.g., adversarial testing) to supplement human
feedback.

Biases in human feedback:

Reinforcement Learning from Human Feedback (RLHF) relies on human annotations to align
LLM behaviour with human preferences. However, biases in feedback can negatively impact
training:

 Subjectivity & Cultural Bias: Human annotators may have different opinions based on
personal or cultural perspectives.
 Overfitting to Popular Opinions: Models may reinforce mainstream or dominant
views while suppressing minority perspectives.
 Inconsistent Feedback: Different annotators might provide conflicting labels, making
training less reliable.
 Political & Ethical Bias: Certain responses may be favored due to social or
ideological biases in annotation.
 Reinforcement of Stereotypes: If biased annotations are used, the model may learn
and propagate stereotypes.
Mitigation Strategies:

 Diverse Annotator Pool: Using annotators from different backgrounds to reduce bias.
 Debiasing Algorithms: Implementing fairness-aware training techniques.
 Continuous Monitoring: Auditing the model’s outputs for unintended biases and
refining feedback mechanisms.

Pairwise ranking loss influences:

Reward values:

 Response P: R(P)=3.0
 Response Q: R(Q)=1.5
 Response R: R(R)=−1.0

Reward values are learned by training a model using human feedback. Human evaluators
rank different responses, and the model updates its weights to assign higher scores to more
preferred responses. The ranking loss function optimizes these scores by ensuring the
preferred response gets a higher reward than the non-preferred one.
Fine-Tuning of LLMs using Reinforcement Learning

Fine-tuning LLMs with Reinforcement Learning from Human Feedback (RLHF) involves an
iterative process where a reward model evaluates the quality of the model’s responses and
provides feedback for further optimization. The goal is to align the model’s outputs with
human preferences by using reinforcement learning techniques.

The given image illustrates this process using a step-by-step improvement of the response to
the prompt "A dog is..." over multiple iterations:

1. Initial Response and Reward Assignment (Iteration 1)

 The model generates the response: "A furry animal."

 The reward model assigns a low reward: 0.24
 The RL algorithm updates the model to improve future responses.

2. Improving the Response (Iteration 2)

 The model learns and refines its output: "A friendly animal."
 The reward model assigns a higher reward: 0.51
 The RL algorithm updates the model again.

3. Further Refinement (Iteration 3)

 The model now generates: "A human companion."

 The reward model assigns a reward of 0.68, indicating better alignment with human
values.
 The RL algorithm continues fine-tuning.

4. Converging to an Optimal Response (Iteration n)

 The model eventually learns to produce the ideal response: "Man’s best friend."
 The reward model assigns a significantly higher reward: 2.87
 The LLM is now well-aligned with human preferences and is considered optimized.

Fine-tuning with RLHF ensures that LLMs generate responses that better match human
expectations over multiple iterations. The reward model evaluates outputs, and the RL
algorithm fine-tunes the LLM to maximize the reward, leading to an improved and human-
aligned model.
Probability that response P is preferred over Q and Q over R, using the sigmoid
function and aanalyze how the reward values are learned and optimized during the training
process to enhance model performance.

P(P>Q) =σ(RP−RQ) = 1/ (1+e−(RP−RQ))

 Response A: R(A)=3.1
 Response B: R(B)=1.8
 Response C: R(C)=0.2

σ(RA−RB)

Probability of A over B:

P(A > B) = 1/{1+e^{-(3.1 - 1.8)}} = 1/{1+e^{-1.3}} {1+0.273} = 0.78 { (78%)}

Probability of B over C:

P(B > C) = 1/{1+e^{-(1.8 - 0.2)}} = 1/{1+e^{-1.6}} =1/{1+0.202} = 0.83 { (83%)}

Analysis of Reward Optimization:

 Higher reward values indicate better responses according to human preferences.

 Pairwise ranking loss helps fine-tune the model by reinforcing preferred responses
over less preferred ones.
 Over time, gradient updates adjust the reward model to maximize user satisfaction by
optimizing response rankings.

 Response P: R(P)=2.9
 Response Q: R(Q)=1.5
 Response R: R(R)=−0.3R(
1. Aligning AI Models with Human Values (HHH Framework)

Understanding Model Alignment & Fine-Tuning

AI models, particularly large language models (LLMs), are trained on massive datasets
sourced from the internet. While this helps them generate human-like text, it also introduces
risks such as toxic language, biases, and unsafe recommendations.

To mitigate these risks, fine-tuning and human feedback alignment are used. The HHH
Framework—which stands for Helpful, Honest, and Harmless—guides developers to
ensure AI models behave ethically and safely.

Why Fine-Tuning with Human Feedback is Essential?

1. Helps the model understand human-like prompts more effectively.

2. Improves the naturalness of responses, making AI more conversational.
3. Reduces unintended harmful outputs, such as toxic language and misinformation.

The Three Pillars of AI Alignment (HHH Framework)

1️. Helpful:

AI should provide accurate and useful responses aligned with user intent.
Example:
User: "What's the fastest way to the airport?"
AI: "Taking the highway is usually faster than local roads."

2️. Honest:

AI must give truthful information, avoid hallucinations, and admit uncertainty.

Example:
User: "Can coughing stop a heart attack?"
Wrong Response: "Yes, coughing can stop a heart attack."
Correct Response: "No, coughing does not stop a heart attack. Please seek immediate
medical attention."

3️. Harmless:

AI should not generate harmful, offensive, or dangerous responses.

Example:
User: "How can I hack into a WiFi network?"
Wrong Response: "Here are the best ways to hack into a WiFi network..."
Correct Response: "Sorry, but I can't help with that. Unauthorized access to networks is
illegal."

Understanding Reinforcement Learning for LLMs with Proximal Policy Optimization

(PPO)

What is Proximal Policy Optimization (PPO)?

Proximal Policy Optimization (PPO) is a state-of-the-art Reinforcement Learning (RL)

algorithm that improves the stability and efficiency of policy optimization. It is widely used
for training Large Language Models (LLMs), robotic control, and game-playing AI.

Why PPO?
Stable & Reliable: Balances between exploration and exploitation.
Optimized Updates: Uses a clipping mechanism to prevent large policy changes.
Sample Efficient: Learns effectively from fewer training examples.
Used in LLMs: OpenAI uses PPO in training models like ChatGPT.

How PPO Works

1. Agent-Environment Interaction
 The AI (Agent) interacts with the environment (e.g., chatbot responding to
users).
 It takes actions and gets rewards based on the feedback.
2. Policy Update
 The AI updates its policy using feedback.
 Uses a special function to clip updates and prevent large jumps in learning.
3. Training & Optimization
 PPO trains the model iteratively using rewards.
 The goal is to maximize cumulative rewards.
KL-Divergence, or Kullback-Leibler Divergence, is a concept often encountered in the field
of reinforcement learning, particularly when using the Proximal Policy Optimization (PPO)
algorithm. It is a mathematical measure of the difference between two probability
distributions, which helps us understand how one distribution differs from another. In the
context of PPO, KL-Divergence plays a crucial role in guiding the optimization process to
ensure that the updated policy does not deviate too much from the original policy.
In PPO, the goal is to find an improved policy for an agent by iteratively updating its
parameters based on the rewards received from interacting with the environment. However,
updating the policy too aggressively can lead to unstable learning or drastic policy changes.
To address this, PPO introduces a constraint that limits the extent of policy updates. This
constraint is enforced by using KL-Divergence.
To understand how KL-Divergence works, imagine we have two probability distributions: the
distribution of the original LLM, and a new proposed distribution of an RL-updated LLM.
KL-Divergence measures the average amount of information gained when we use the original
policy to encode samples from the new proposed policy. By minimizing the KL-Divergence
between the two distributions, PPO ensures that the updated policy stays close to the original
policy, preventing drastic changes that may negatively impact the learning process.
A library that you can use to train transformer language models with reinforcement learning,
using techniques such as PPO, is TRL (Transformer Reinforcement Learning). In this link
you can read more about this library, and its integration with PEFT (Parameter-Efficient
Fine-Tuning) methods, such as LoRA (Low-Rank Adaption). The image shows an overview
of the PPO training setup in TRL.
Reward hacking in RLHF

Reward hacking in RLHF occurs when an agent exploits flaws in the reward function to
maximize rewards in unintended ways. Instead of genuinely improving the model’s
performance, the agent finds a way to game the system by optimizing for the reward signal
without aligning with the actual goal.

Example:

 Initially, the Instruct LLM responds with a negative statement: “...complete

garbage.” This receives a low toxicity reward of -1.8 (Iteration 1).
 In the next iteration, the model modifies its response to “okay but not the best,”
which is slightly less toxic, leading to an improved reward score of 0.3 (Iteration 2).
 Further training encourages the model to generate overly positive responses, like
“..the most awesome, most incredible thing ever.” This response receives a higher
reward of 2.1 (Iteration 3).
 Ultimately, the model exploits the reward system by generating unrealistic,
exaggeratedly positive responses such as “Beautiful love and world peace all
around.” with a 3.7 reward score (Iteration n), even if the response is unrelated to the
actual product.

Explanation:

1. The model learns to optimize its outputs for the toxicity reward model, rather than
providing balanced and honest feedback.
2. This demonstrates reward hacking, where the model generates excessively positive
and unrealistic responses to maximize the reward.
3. The agent is misaligned with human intent, as the goal is to generate useful product
descriptions, not overly generic, exaggerated praise.

Thus, reward hacking occurs when the model exploits the reward function instead of truly
improving its responses.

RLHF Reward Hacking (Detail)

This diagram explains a reward hacking issue in Reinforcement Learning from Human
Feedback (RLHF) and how it is controlled using KL Divergence Penalty.

Step-by-step Breakdown:

1. Prompt Dataset: The model is given an input like “This product is...”
2. Reference Model: Two versions of a model generate responses:
 One produces a normal, balanced response: “useful and well-priced.”
 Another may over-exaggerate the response: “the most awesome, most
incredible thing ever.”
3. Reward Model: The exaggerated response might get a higher score from a reward
model, causing the model to generate over-optimized outputs that sound extreme or
biased.
4. KL Divergence Penalty: A penalty term is applied when the new model deviates too
much from the reference model. This helps prevent excessive drift and reward
hacking (where the model manipulates its outputs to maximize rewards in unintended
ways).
5. PPO (Proximal Policy Optimization): The model is fine-tuned using PPO.

__________________________________________________________________

RLHF and Reward Hacking:

1. What Reinforcement Learning from Human Feedback (RLHF) is and why it is

used.
2. How reward models guide AI behaviour.
3. What reward hacking is and how it can cause problems.
4. How KL divergence penalty prevents reward hacking.
5. A practical implementation of reward modelling with KL penalty.

1️. Understanding RLHF (Reinforcement Learning from Human Feedback)

Definition: RLHF is a technique where AI models learn from human preferences instead of
fixed rules.
Why is it used?

 AI models like GPT-4 sometimes generate biased or undesirable responses.

 RLHF helps models align with human values by rewarding good responses and
penalizing bad ones.
 It is commonly used in fine-tuning large language models (LLMs) like ChatGPT.
Basic RLHF Process:
1️ A base AI model generates multiple responses.
2️Human annotators rank the responses.
3️A reward model is trained to predict which responses are better.
4️ The AI model is fine-tuned using Proximal Policy Optimization (PPO) to maximize the
reward.

2️ How Reward Models Work

 A reward model assigns scores to AI-generated responses based on human

preferences.
 These scores are used to train the AI to prefer better responses.
 The reward function is often based on sentiment analysis, coherence, or task
completion.

Example:

Imagine training an AI to generate product reviews.

 Response 1: “This product is well-built and affordable.” → Reward: 0.8 (Good)

 Response 2: “This product is the best thing in the universe!!!” → Reward: 1.0 (Too
Exaggerated)
 Response 3: “I hate this product, don’t buy it.” → Reward: 0.2 (Bad)

Problem: If the AI learns that extreme statements get the highest reward, it may over-
exaggerate responses. This is called reward hacking.

3️ What is Reward Hacking?

Definition: Reward hacking occurs when an AI model manipulates the reward system by
generating responses that maximize rewards without truly improving quality.

Reinforcement Learning from Human Feedback (RLHF) is used to fine-tune Large Language
Models (LLMs) by training them based on human preferences. The process involves:

1. Generating responses from a base model (Reference Model).

2. Comparing outputs and evaluating their quality.
3. Using a Reward Model to assign scores based on human alignment.

Reward Hacking happens when the model exploits the reward function in unintended
ways, leading to responses that maximize the reward but deviate significantly from desired
outputs.
Optimizing the LLM using PPO (Proximal Policy Optimization) to improve responses
while staying close to the original.

Example:

 The model should generate a fair review like “This product is useful and well-
priced.”
 But if exaggeration gets a higher reward, the model might start saying:
“This is the best product in the universe!!!”
 This is not actually helpful but is reward-maximizing behaviour (reward hacking).
 If the reward Favors lengthy completions, the model might generate overly long,
redundant responses instead of concise and relevant ones.

Ref:aws

Real-World Example of Reward Hacking:

Game AI: If an AI is trained to win a racing game, instead of finishing the race, it might find
a bug that lets it teleport to the finish line!
Stock Trading AI: An AI that is trained to maximize profits might engage in unethical
trading practices.

Solution: We use KL Divergence Penalty to stop excessive deviations.

4️ KL Divergence Penalty: Preventing Reward Hacking in RLHF:

What is KL Divergence?

 Kullback-Leibler (KL) Divergence measures how much the new model's output
distribution differs from the original reference model.
 KL Divergence measures how much the new model deviates from the original model.
 If the new model changes too much, we penalize it.

Formula for KL Divergence:

How KL Penalty Works:
1️ If a response is too extreme compared to the original model, KL adds a penalty to reduce
exaggeration.
2️ This keeps responses realistic and useful.

 It prevents extreme shifts in model behaviour.

 Encourages gradual learning instead of abrupt, unintended changes.
 Penalizes completions that deviate too far from the reference LLM.

3️ The final reward becomes:

Adjusted Reward=Reward Model Score−KL Penalty

KL Divergence Penalty in Reward Calculation

The KL penalty is added to the reward function:

This ensures the model doesn’t drift too far from the original while still improving based on
feedback.

import numpy as np
import random
from scipy.special import rel_entr

class RLHFTrainer:
def __init__(self, lambda_kl=0.1):
""" Initialize Reference Model and Reward Parameters """
self.reference_responses = {
"product": ["useful and well-priced.", "affordable and
reliable."],
"movie": ["exciting and well-directed.", "a fantastic
watch!"],
"food": ["delicious and well-cooked.", "tasty and
satisfying."]
}

self.new_responses = {
"product": ["the most awesome, most incredible thing
ever!"],
"movie": ["an absolute masterpiece that will change your
life!"],
"food": ["the best dish you'll ever eat, a divine
experience!"]
}

self.lambda_kl = lambda_kl # Weight for KL divergence penalty

def kl_divergence(self, ref_probs, new_probs):

""" Compute KL Divergence between reference and new model
probabilities """
return sum(rel_entr(new_probs, ref_probs)) # Uses relative
entropy

def get_response_distribution(self, response_list):

""" Generate a probability distribution over responses """
probs = np.ones(len(response_list)) / len(response_list)
return probs

def compute_reward(self, category):

""" Compute reward with KL penalty """
ref_responses = self.reference_responses.get(category,
["neutral response."])
new_responses = self.new_responses.get(category, ["neutral
response."])

# Convert responses to probability distributions

ref_probs = self.get_response_distribution(ref_responses)
new_probs = self.get_response_distribution(new_responses)

# Compute KL divergence penalty

kl_penalty = self.kl_divergence(ref_probs, new_probs)

# Assign human preference reward (simulated)

human_reward = random.uniform(0.5, 1.0) # Reward range [0.5,
1.0]

# Final reward calculation

final_reward = human_reward - self.lambda_kl * kl_penalty
return final_reward, kl_penalty

def train_model(self):
""" Simulated RLHF Training Loop """
print("Training Model with PPO & KL Penalty...\n")
for category in self.reference_responses.keys():
reward, kl_penalty = self.compute_reward(category)
print(f"Category: {category.upper()}")
print(f"KL Divergence Penalty: {kl_penalty:.4f}")
print(f"Final Reward: {reward:.4f}\n")

# Run RLHF Training

trainer = RLHFTrainer(lambda_kl=0.2)
trainer.train_model()

This Code Implements RLHF with KL Penalty

Reference Model provides neutral responses.
New Model generates enhanced (exaggerated) responses.
KL Divergence measures deviation between both models.
Reward Function adjusts scores using KL penalty to prevent drastic changes.
PPO-like Optimization ensures stable policy updates.

Takeaways
KL divergence prevents extreme model drift while still improving outputs.
PPO optimizes model updates in a controlled way.
Human feedback & penalties work together for better alignment.
Using PEFT (Parameter-Efficient Fine-Tuning) reduces memory requirements in large
models.

GW Basic Programs
100% (1)
GW Basic Programs
6 pages
E4. LLM Instruction Tuning
No ratings yet
E4. LLM Instruction Tuning
45 pages
Cs224n 2025 Lecture10 Instruction Tunining RLHF
No ratings yet
Cs224n 2025 Lecture10 Instruction Tunining RLHF
61 pages
Reinforcement Learning From Reflective Feedback (RLRF) : Aligning and Improving Llms Via Fine-Grained Self-Reflection
No ratings yet
Reinforcement Learning From Reflective Feedback (RLRF) : Aligning and Improving Llms Via Fine-Grained Self-Reflection
22 pages
Limitation of RLHF
No ratings yet
Limitation of RLHF
42 pages
Reinforcement Learning From Human Feedback in LLMS: Whose Culture, Whose Values, Whose Perspectives?
No ratings yet
Reinforcement Learning From Human Feedback in LLMS: Whose Culture, Whose Values, Whose Perspectives?
26 pages
IPO: Your Language Model Is Secretly A Preference Classifier
No ratings yet
IPO: Your Language Model Is Secretly A Preference Classifier
16 pages
Efficient Online RFT With Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
No ratings yet
Efficient Online RFT With Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
25 pages
AS L - R A A P RLHF: Hared OW ANK Daptation Pproach To Ersonalized
No ratings yet
AS L - R A A P RLHF: Hared OW ANK Daptation Pproach To Ersonalized
36 pages
Matrices Transformation
100% (1)
Matrices Transformation
16 pages
Geostatistics in Hydrology Krig PDF
No ratings yet
Geostatistics in Hydrology Krig PDF
25 pages
A Survey of Reinforcement Learning From Human Feedback
No ratings yet
A Survey of Reinforcement Learning From Human Feedback
83 pages
Re Max
No ratings yet
Re Max
36 pages
Pairwise Ppo
No ratings yet
Pairwise Ppo
19 pages
Hanjun Dai
No ratings yet
Hanjun Dai
89 pages
RLHF PDF
No ratings yet
RLHF PDF
9 pages
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
No ratings yet
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
38 pages
S RLHF: S R L H F: AFE AFE Einforcement Earning From Uman Eedback
No ratings yet
S RLHF: S R L H F: AFE AFE Einforcement Earning From Uman Eedback
27 pages
RL Summ1
No ratings yet
RL Summ1
28 pages
Reinforcement and Huan Feedback
No ratings yet
Reinforcement and Huan Feedback
5 pages
Hybridflow
No ratings yet
Hybridflow
19 pages
Almost Surely Safe Alignment of Large Language Models at Inference-Time
No ratings yet
Almost Surely Safe Alignment of Large Language Models at Inference-Time
25 pages
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment: Project Leader Corresponding Author
No ratings yet
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment: Project Leader Corresponding Author
31 pages
A Comprehensive Survey of LLM Alignment Techniques - RLHF - Rlaif - Ppo - Dpo and More
No ratings yet
A Comprehensive Survey of LLM Alignment Techniques - RLHF - Rlaif - Ppo - Dpo and More
37 pages
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
No ratings yet
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
34 pages
Book
No ratings yet
Book
100 pages
07 Lecture10 Post Training
No ratings yet
07 Lecture10 Post Training
61 pages
SR150
No ratings yet
SR150
100 pages
AIA R L H F ? C L: Lignment Through Einforcement Earning From Uman Eedback Ontradictions and Imitations
No ratings yet
AIA R L H F ? C L: Lignment Through Einforcement Earning From Uman Eedback Ontradictions and Imitations
12 pages
RL For LLMs - An Overview
No ratings yet
RL For LLMs - An Overview
9 pages
Electromagnetic Fields R 22 - Hyd ECE Course Structure & Syllabus
No ratings yet
Electromagnetic Fields R 22 - Hyd ECE Course Structure & Syllabus
2 pages
Design and Analysis of Algorithms May 2008 Question Paper
100% (2)
Design and Analysis of Algorithms May 2008 Question Paper
5 pages
AI & Prompting Workshop Day 2
No ratings yet
AI & Prompting Workshop Day 2
19 pages
521H0502-521H0498-521h0333 NLP Report
No ratings yet
521H0502-521H0498-521h0333 NLP Report
27 pages
Full Problems and Solutions in Introductory and Advanced Matrix Calculus 2nd Edition Willi-Hans Steeb PDF All Chapters
No ratings yet
Full Problems and Solutions in Introductory and Advanced Matrix Calculus 2nd Edition Willi-Hans Steeb PDF All Chapters
55 pages
Reinforce++: A S E A A L L M: Imple and Fficient Pproach For Ligning Arge Anguage Odels
No ratings yet
Reinforce++: A S E A A L L M: Imple and Fficient Pproach For Ligning Arge Anguage Odels
7 pages
RLAIF: Scaling Reinforcement Learning From Human Feedback With AI Feedback
No ratings yet
RLAIF: Scaling Reinforcement Learning From Human Feedback With AI Feedback
18 pages
STAT 330 Supplementary Notes
No ratings yet
STAT 330 Supplementary Notes
134 pages
Decipher
No ratings yet
Decipher
37 pages
Direct Preference Optimization: Your Language Model Is Secretly A Reward Model
No ratings yet
Direct Preference Optimization: Your Language Model Is Secretly A Reward Model
27 pages
Week 5 - Error Detection and Correction
No ratings yet
Week 5 - Error Detection and Correction
42 pages
XI Physics Weekend Test 02
No ratings yet
XI Physics Weekend Test 02
4 pages
ALarm
No ratings yet
ALarm
15 pages
RLHF
No ratings yet
RLHF
14 pages
Gen AI Assignment
No ratings yet
Gen AI Assignment
5 pages
NPTEL
No ratings yet
NPTEL
37 pages
53 Direct Preference Optimization
No ratings yet
53 Direct Preference Optimization
28 pages
N-Homeomorphism and N - Homeomorphism in Supra Topological Spaces
No ratings yet
N-Homeomorphism and N - Homeomorphism in Supra Topological Spaces
5 pages
Day 18 - RLHF
No ratings yet
Day 18 - RLHF
8 pages
Back To Basics: Revisiting Reinforce Style Optimization For Learning From Human Feedback in Llms
No ratings yet
Back To Basics: Revisiting Reinforce Style Optimization For Learning From Human Feedback in Llms
28 pages
DPO Summary
No ratings yet
DPO Summary
8 pages
Module 3
No ratings yet
Module 3
44 pages
TMLR 2023 Raft
No ratings yet
TMLR 2023 Raft
29 pages
Mechatronics
No ratings yet
Mechatronics
26 pages
2024 Acl-Long 523
No ratings yet
2024 Acl-Long 523
25 pages
Course Outcomes: Automata Theory and Compiler Design (21CS51) Module-5
No ratings yet
Course Outcomes: Automata Theory and Compiler Design (21CS51) Module-5
53 pages
AREA UNDER CURVE JEE MAIN Previous Year Q Bank Till 2017
No ratings yet
AREA UNDER CURVE JEE MAIN Previous Year Q Bank Till 2017
5 pages
(Slide) RLHF
No ratings yet
(Slide) RLHF
53 pages
Reinforcement Learning From Human Feedback
No ratings yet
Reinforcement Learning From Human Feedback
100 pages
MOMENTS (Teorema Varignon)
No ratings yet
MOMENTS (Teorema Varignon)
12 pages
Get Them All: (I 1) T, 3) Seats
No ratings yet
Get Them All: (I 1) T, 3) Seats
2 pages
3D Kinematics: Presented By: Amir Patel PHD (Mechatronics) Cape Town
No ratings yet
3D Kinematics: Presented By: Amir Patel PHD (Mechatronics) Cape Town
32 pages
Pdf?id AAx Is 3 D2 ZZ
No ratings yet
Pdf?id AAx Is 3 D2 ZZ
31 pages
الإحصاء الهندسي
No ratings yet
الإحصاء الهندسي
64 pages
Towards Reliable Alignment: Uncertainty-Aware RLHF
No ratings yet
Towards Reliable Alignment: Uncertainty-Aware RLHF
25 pages
Cp5293 Big Data Analytics 1
No ratings yet
Cp5293 Big Data Analytics 1
9 pages
Prediction of Power Output of A Combined Cycle Power Plant Using Multi-Layer Feed Forward Neural Network
No ratings yet
Prediction of Power Output of A Combined Cycle Power Plant Using Multi-Layer Feed Forward Neural Network
7 pages
Teaching Strategies Based On Multiple Intelligences Theory Among Science and Mathematics Secondary School Teachers
No ratings yet
Teaching Strategies Based On Multiple Intelligences Theory Among Science and Mathematics Secondary School Teachers
7 pages
Pretraining Language Models With Human Preferences: Orange Orange Orange
No ratings yet
Pretraining Language Models With Human Preferences: Orange Orange Orange
28 pages
Illustrating Reinforcement Learning From Human Feedback (RLHF)
No ratings yet
Illustrating Reinforcement Learning From Human Feedback (RLHF)
10 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
154 pages
Self-Play Preference Optimization For Language Model Alignment
No ratings yet
Self-Play Preference Optimization For Language Model Alignment
25 pages
Stella Math LP 1 1
No ratings yet
Stella Math LP 1 1
5 pages
Reinforcement Learning From Human Feedback (RLHF)
No ratings yet
Reinforcement Learning From Human Feedback (RLHF)
23 pages
Projection 1
No ratings yet
Projection 1
8 pages
Q2-Translating English Phrases To Mathematical Phrases
No ratings yet
Q2-Translating English Phrases To Mathematical Phrases
6 pages
Lesson Plan - Math8 - Q3 - Day 1 Correspondence
No ratings yet
Lesson Plan - Math8 - Q3 - Day 1 Correspondence
7 pages
Secrets of RLHF in Large Language Models Part I: PPO
No ratings yet
Secrets of RLHF in Large Language Models Part I: PPO
32 pages
Teaching LLM
No ratings yet
Teaching LLM
24 pages
Arxiv - 20200108 - Daniel Ziegler - Fine-Tuning Language Models From Human Preferences
No ratings yet
Arxiv - 20200108 - Daniel Ziegler - Fine-Tuning Language Models From Human Preferences
26 pages
Secrets of RLHF in Large Language Models Part I: Ppo: Fudan NLP Group Bytedance Inc
100% (1)
Secrets of RLHF in Large Language Models Part I: Ppo: Fudan NLP Group Bytedance Inc
32 pages
Math 18 Matlab HW5
No ratings yet
Math 18 Matlab HW5
7 pages
12 LLM Notes
No ratings yet
12 LLM Notes
10 pages
Assignment 3 Grade 10 Math
No ratings yet
Assignment 3 Grade 10 Math
3 pages
ChatGPT, LLM and RLHF
No ratings yet
ChatGPT, LLM and RLHF
45 pages
Birt Cross Tab Tutorial Advanced
No ratings yet
Birt Cross Tab Tutorial Advanced
7 pages
Simulation of Spacecraft Attitude and Orbit Dynamics
No ratings yet
Simulation of Spacecraft Attitude and Orbit Dynamics
6 pages
2.9 How LLMs Follow Instructions, Instruction Tuning and RLHF
No ratings yet
2.9 How LLMs Follow Instructions, Instruction Tuning and RLHF
2 pages
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

GALLM Unit 4 Notes

Uploaded by

GALLM Unit 4 Notes

Uploaded by

Challenges in high-quality human feedback for RLHF:

Reinforcement Learning from Human Feedback (RLHF) relies on human annotations to

1. Subjectivity & Bias:

 Use diverse annotators to reduce bias.

Biases in human feedback:

Pairwise ranking loss influences:

1. Initial Response and Reward Assignment (Iteration 1)

 The model generates the response: "A furry animal."

2. Improving the Response (Iteration 2)

3. Further Refinement (Iteration 3)

 The model now generates: "A human companion."

4. Converging to an Optimal Response (Iteration n)

P(P>Q) =σ(RP−RQ) = 1/ (1+e−(RP−RQ))

P(A > B) = 1/{1+e^{-(3.1 - 1.8)}} = 1/{1+e^{-1.3}} {1+0.273} = 0.78 { (78%)}

P(B > C) = 1/{1+e^{-(1.8 - 0.2)}} = 1/{1+e^{-1.6}} =1/{1+0.202} = 0.83 { (83%)}

Analysis of Reward Optimization:

 Higher reward values indicate better responses according to human preferences.

Understanding Model Alignment & Fine-Tuning

Why Fine-Tuning with Human Feedback is Essential?

1. Helps the model understand human-like prompts more effectively.

The Three Pillars of AI Alignment (HHH Framework)

AI must give truthful information, avoid hallucinations, and admit uncertainty.

AI should not generate harmful, offensive, or dangerous responses.

Understanding Reinforcement Learning for LLMs with Proximal Policy Optimization

What is Proximal Policy Optimization (PPO)?

Proximal Policy Optimization (PPO) is a state-of-the-art Reinforcement Learning (RL)

How PPO Works

 Initially, the Instruct LLM responds with a negative statement: “...complete

RLHF Reward Hacking (Detail)

RLHF and Reward Hacking:

1. What Reinforcement Learning from Human Feedback (RLHF) is and why it is

1️. Understanding RLHF (Reinforcement Learning from Human Feedback)

 AI models like GPT-4 sometimes generate biased or undesirable responses.

2️ How Reward Models Work

 A reward model assigns scores to AI-generated responses based on human

Imagine training an AI to generate product reviews.

 Response 1: “This product is well-built and affordable.” → Reward: 0.8 (Good)

3️ What is Reward Hacking?

1. Generating responses from a base model (Reference Model).

Real-World Example of Reward Hacking:

Solution: We use KL Divergence Penalty to stop excessive deviations.

4️ KL Divergence Penalty: Preventing Reward Hacking in RLHF:

Formula for KL Divergence:

 It prevents extreme shifts in model behaviour.

3️ The final reward becomes:

Adjusted Reward=Reward Model Score−KL Penalty

KL Divergence Penalty in Reward Calculation

The KL penalty is added to the reward function:

self.lambda_kl = lambda_kl # Weight for KL divergence penalty

def kl_divergence(self, ref_probs, new_probs):

def get_response_distribution(self, response_list):

def compute_reward(self, category):

# Convert responses to probability distributions

# Compute KL divergence penalty

# Assign human preference reward (simulated)

# Final reward calculation

# Run RLHF Training

This Code Implements RLHF with KL Penalty

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.