GALLM Unit 4 Notes
GALLM Unit 4 Notes
Mitigation Strategies:
Reinforcement Learning from Human Feedback (RLHF) relies on human annotations to align
LLM behaviour with human preferences. However, biases in feedback can negatively impact
training:
Subjectivity & Cultural Bias: Human annotators may have different opinions based on
personal or cultural perspectives.
Overfitting to Popular Opinions: Models may reinforce mainstream or dominant
views while suppressing minority perspectives.
Inconsistent Feedback: Different annotators might provide conflicting labels, making
training less reliable.
Political & Ethical Bias: Certain responses may be favored due to social or
ideological biases in annotation.
Reinforcement of Stereotypes: If biased annotations are used, the model may learn
and propagate stereotypes.
Mitigation Strategies:
Diverse Annotator Pool: Using annotators from different backgrounds to reduce bias.
Debiasing Algorithms: Implementing fairness-aware training techniques.
Continuous Monitoring: Auditing the model’s outputs for unintended biases and
refining feedback mechanisms.
Reward values:
Response P: R(P)=3.0
Response Q: R(Q)=1.5
Response R: R(R)=−1.0
Reward values are learned by training a model using human feedback. Human evaluators
rank different responses, and the model updates its weights to assign higher scores to more
preferred responses. The ranking loss function optimizes these scores by ensuring the
preferred response gets a higher reward than the non-preferred one.
Fine-Tuning of LLMs using Reinforcement Learning
Fine-tuning LLMs with Reinforcement Learning from Human Feedback (RLHF) involves an
iterative process where a reward model evaluates the quality of the model’s responses and
provides feedback for further optimization. The goal is to align the model’s outputs with
human preferences by using reinforcement learning techniques.
The given image illustrates this process using a step-by-step improvement of the response to
the prompt "A dog is..." over multiple iterations:
The model learns and refines its output: "A friendly animal."
The reward model assigns a higher reward: 0.51
The RL algorithm updates the model again.
The model eventually learns to produce the ideal response: "Man’s best friend."
The reward model assigns a significantly higher reward: 2.87
The LLM is now well-aligned with human preferences and is considered optimized.
Fine-tuning with RLHF ensures that LLMs generate responses that better match human
expectations over multiple iterations. The reward model evaluates outputs, and the RL
algorithm fine-tunes the LLM to maximize the reward, leading to an improved and human-
aligned model.
Probability that response P is preferred over Q and Q over R, using the sigmoid
function and aanalyze how the reward values are learned and optimized during the training
process to enhance model performance.
Response A: R(A)=3.1
Response B: R(B)=1.8
Response C: R(C)=0.2
σ(RA−RB)
Probability of A over B:
Probability of B over C:
OR
Response P: R(P)=2.9
Response Q: R(Q)=1.5
Response R: R(R)=−0.3R(
1. Aligning AI Models with Human Values (HHH Framework)
AI models, particularly large language models (LLMs), are trained on massive datasets
sourced from the internet. While this helps them generate human-like text, it also introduces
risks such as toxic language, biases, and unsafe recommendations.
To mitigate these risks, fine-tuning and human feedback alignment are used. The HHH
Framework—which stands for Helpful, Honest, and Harmless—guides developers to
ensure AI models behave ethically and safely.
1️. Helpful:
AI should provide accurate and useful responses aligned with user intent.
Example:
User: "What's the fastest way to the airport?"
AI: "Taking the highway is usually faster than local roads."
2️. Honest:
3️. Harmless:
Why PPO?
Stable & Reliable: Balances between exploration and exploitation.
Optimized Updates: Uses a clipping mechanism to prevent large policy changes.
Sample Efficient: Learns effectively from fewer training examples.
Used in LLMs: OpenAI uses PPO in training models like ChatGPT.
1. Agent-Environment Interaction
The AI (Agent) interacts with the environment (e.g., chatbot responding to
users).
It takes actions and gets rewards based on the feedback.
2. Policy Update
The AI updates its policy using feedback.
Uses a special function to clip updates and prevent large jumps in learning.
3. Training & Optimization
PPO trains the model iteratively using rewards.
The goal is to maximize cumulative rewards.
KL-Divergence, or Kullback-Leibler Divergence, is a concept often encountered in the field
of reinforcement learning, particularly when using the Proximal Policy Optimization (PPO)
algorithm. It is a mathematical measure of the difference between two probability
distributions, which helps us understand how one distribution differs from another. In the
context of PPO, KL-Divergence plays a crucial role in guiding the optimization process to
ensure that the updated policy does not deviate too much from the original policy.
In PPO, the goal is to find an improved policy for an agent by iteratively updating its
parameters based on the rewards received from interacting with the environment. However,
updating the policy too aggressively can lead to unstable learning or drastic policy changes.
To address this, PPO introduces a constraint that limits the extent of policy updates. This
constraint is enforced by using KL-Divergence.
To understand how KL-Divergence works, imagine we have two probability distributions: the
distribution of the original LLM, and a new proposed distribution of an RL-updated LLM.
KL-Divergence measures the average amount of information gained when we use the original
policy to encode samples from the new proposed policy. By minimizing the KL-Divergence
between the two distributions, PPO ensures that the updated policy stays close to the original
policy, preventing drastic changes that may negatively impact the learning process.
A library that you can use to train transformer language models with reinforcement learning,
using techniques such as PPO, is TRL (Transformer Reinforcement Learning). In this link
you can read more about this library, and its integration with PEFT (Parameter-Efficient
Fine-Tuning) methods, such as LoRA (Low-Rank Adaption). The image shows an overview
of the PPO training setup in TRL.
Reward hacking in RLHF
Reward hacking in RLHF occurs when an agent exploits flaws in the reward function to
maximize rewards in unintended ways. Instead of genuinely improving the model’s
performance, the agent finds a way to game the system by optimizing for the reward signal
without aligning with the actual goal.
Example:
Explanation:
1. The model learns to optimize its outputs for the toxicity reward model, rather than
providing balanced and honest feedback.
2. This demonstrates reward hacking, where the model generates excessively positive
and unrealistic responses to maximize the reward.
3. The agent is misaligned with human intent, as the goal is to generate useful product
descriptions, not overly generic, exaggerated praise.
Thus, reward hacking occurs when the model exploits the reward function instead of truly
improving its responses.
This diagram explains a reward hacking issue in Reinforcement Learning from Human
Feedback (RLHF) and how it is controlled using KL Divergence Penalty.
Step-by-step Breakdown:
1. Prompt Dataset: The model is given an input like “This product is...”
2. Reference Model: Two versions of a model generate responses:
One produces a normal, balanced response: “useful and well-priced.”
Another may over-exaggerate the response: “the most awesome, most
incredible thing ever.”
3. Reward Model: The exaggerated response might get a higher score from a reward
model, causing the model to generate over-optimized outputs that sound extreme or
biased.
4. KL Divergence Penalty: A penalty term is applied when the new model deviates too
much from the reference model. This helps prevent excessive drift and reward
hacking (where the model manipulates its outputs to maximize rewards in unintended
ways).
5. PPO (Proximal Policy Optimization): The model is fine-tuned using PPO.
__________________________________________________________________
Definition: RLHF is a technique where AI models learn from human preferences instead of
fixed rules.
Why is it used?
Example:
Problem: If the AI learns that extreme statements get the highest reward, it may over-
exaggerate responses. This is called reward hacking.
Definition: Reward hacking occurs when an AI model manipulates the reward system by
generating responses that maximize rewards without truly improving quality.
Reinforcement Learning from Human Feedback (RLHF) is used to fine-tune Large Language
Models (LLMs) by training them based on human preferences. The process involves:
Reward Hacking happens when the model exploits the reward function in unintended
ways, leading to responses that maximize the reward but deviate significantly from desired
outputs.
Optimizing the LLM using PPO (Proximal Policy Optimization) to improve responses
while staying close to the original.
Example:
The model should generate a fair review like “This product is useful and well-
priced.”
But if exaggeration gets a higher reward, the model might start saying:
“This is the best product in the universe!!!”
This is not actually helpful but is reward-maximizing behaviour (reward hacking).
If the reward Favors lengthy completions, the model might generate overly long,
redundant responses instead of concise and relevant ones.
Ref:aws
What is KL Divergence?
Kullback-Leibler (KL) Divergence measures how much the new model's output
distribution differs from the original reference model.
KL Divergence measures how much the new model deviates from the original model.
If the new model changes too much, we penalize it.
This ensures the model doesn’t drift too far from the original while still improving based on
feedback.
import numpy as np
import random
from scipy.special import rel_entr
class RLHFTrainer:
def __init__(self, lambda_kl=0.1):
""" Initialize Reference Model and Reward Parameters """
self.reference_responses = {
"product": ["useful and well-priced.", "affordable and
reliable."],
"movie": ["exciting and well-directed.", "a fantastic
watch!"],
"food": ["delicious and well-cooked.", "tasty and
satisfying."]
}
self.new_responses = {
"product": ["the most awesome, most incredible thing
ever!"],
"movie": ["an absolute masterpiece that will change your
life!"],
"food": ["the best dish you'll ever eat, a divine
experience!"]
}
def train_model(self):
""" Simulated RLHF Training Loop """
print("Training Model with PPO & KL Penalty...\n")
for category in self.reference_responses.keys():
reward, kl_penalty = self.compute_reward(category)
print(f"Category: {category.upper()}")
print(f"KL Divergence Penalty: {kl_penalty:.4f}")
print(f"Final Reward: {reward:.4f}\n")
Takeaways
KL divergence prevents extreme model drift while still improving outputs.
PPO optimizes model updates in a controlled way.
Human feedback & penalties work together for better alignment.
Using PEFT (Parameter-Efficient Fine-Tuning) reduces memory requirements in large
models.