RLHF
What it is
RLHF (Reinforcement Learning from Human Feedback) is a training method that aligns language models with human preferences. Human annotators compare model outputs and indicate preferences (e.g., “Response A is better than Response B”). These preferences are used to train a reward model, which then fine-tunes the language model via reinforcement learning, optimizing for human-preferred outputs.
[illustrate: Data collection → human labeling comparisons → reward model training → RL fine-tuning → improved model outputs]
How it works
-
Data collection:
- Generate outputs from base language model on diverse prompts
- Collect pairs of outputs; annotators rank by quality
- Example: Prompt: “Write a story”, Output_A, Output_B → Annotators prefer Output_A
-
Reward model training:
- Train classifier to predict human preferences: r(prompt, output) ∈ [0, 1]
- Loss: Binary cross-entropy on preference labels
-
RL fine-tuning:
- Use reward model as objective function
- Fine-tune language model via policy gradient (e.g., PPO)
- Goal: Maximize expected reward while staying close to original model
-
Output: Model optimized for human preferences (helpful, harmless, honest)
Example
# Data collection
Prompt: "Explain quantum computing"
Output_A: "Quantum computers use qubits..."
Output_B: "Quantum computing is really complex stuff"
Annotator: "Output_A is more informative" → label_A > label_B
# Reward model
r_model(prompt, output_A) = 0.8
r_model(prompt, output_B) = 0.3
# RL fine-tuning
RL objective: max E[r_model(prompt, model(prompt))]
Constraint: KL divergence to base model small
Variants and history
RLHF was pioneered at Anthropic (Constitutional AI) and applied to GPT-3 by OpenAI (Ouyang et al., 2022), resulting in InstructGPT. Large-scale RLHF enabled GPT-3.5, which dramatically improved usability. Variants include DPO (Direct Preference Optimization, eliminating reward model), IPO (Iterative Preference Optimization), and constitutional AI (replacing human feedback with AI-defined principles). RLHF is crucial for safe, helpful LLMs.
When to use it
Use RLHF when:
- Human preference alignment matters (safety, helpfulness)
- You have budget for human annotation
- Base language model exists and is functional
- You want to optimize for multiple criteria (avoid helpfulness-safety tradeoff)
- Post-training refinement is necessary
RLHF is expensive (requires human annotators) and technically complex. Benefit: significant improvements in usability, safety, and alignment. DPO offers cheaper alternative if high-quality preference data available.