RLHF

Rlhf Reinforcement-Learning Human-Feedback Alignment Nlp Needs-Review

What it is

RLHF (Reinforcement Learning from Human Feedback) is a training method that aligns language models with human preferences. Human annotators compare model outputs and indicate preferences (e.g., “Response A is better than Response B”). These preferences are used to train a reward model, which then fine-tunes the language model via reinforcement learning, optimizing for human-preferred outputs.

[illustrate: Data collection → human labeling comparisons → reward model training → RL fine-tuning → improved model outputs]

How it works

Data collection:
- Generate outputs from base language model on diverse prompts
- Collect pairs of outputs; annotators rank by quality
- Example: Prompt: “Write a story”, Output_A, Output_B → Annotators prefer Output_A
Reward model training:
- Train classifier to predict human preferences: r(prompt, output) ∈ [0, 1]
- Loss: Binary cross-entropy on preference labels
RL fine-tuning:
- Use reward model as objective function
- Fine-tune language model via policy gradient (e.g., PPO)
- Goal: Maximize expected reward while staying close to original model
Output: Model optimized for human preferences (helpful, harmless, honest)

Example

# Data collection
Prompt: "Explain quantum computing"
Output_A: "Quantum computers use qubits..."
Output_B: "Quantum computing is really complex stuff"

Annotator: "Output_A is more informative" → label_A > label_B

# Reward model
r_model(prompt, output_A) = 0.8
r_model(prompt, output_B) = 0.3

# RL fine-tuning
RL objective: max E[r_model(prompt, model(prompt))]
Constraint: KL divergence to base model small

Variants and history

RLHF was pioneered at Anthropic (Constitutional AI) and applied to GPT-3 by OpenAI (Ouyang et al., 2022), resulting in InstructGPT. Large-scale RLHF enabled GPT-3.5, which dramatically improved usability. Variants include DPO (Direct Preference Optimization, eliminating reward model), IPO (Iterative Preference Optimization), and constitutional AI (replacing human feedback with AI-defined principles). RLHF is crucial for safe, helpful LLMs.

When to use it

Use RLHF when:

Human preference alignment matters (safety, helpfulness)
You have budget for human annotation
Base language model exists and is functional
You want to optimize for multiple criteria (avoid helpfulness-safety tradeoff)
Post-training refinement is necessary

RLHF is expensive (requires human annotators) and technically complex. Benefit: significant improvements in usability, safety, and alignment. DPO offers cheaper alternative if high-quality preference data available.