RLHF

What it is

RLHF (Reinforcement Learning from Human Feedback) is a training method that aligns language models with human preferences. Human annotators compare model outputs and indicate preferences (e.g., “Response A is better than Response B”). These preferences are used to train a reward model, which then fine-tunes the language model via reinforcement learning, optimizing for human-preferred outputs.

[illustrate: Data collection → human labeling comparisons → reward model training → RL fine-tuning → improved model outputs]

How it works

  1. Data collection:

    • Generate outputs from base language model on diverse prompts
    • Collect pairs of outputs; annotators rank by quality
    • Example: Prompt: “Write a story”, Output_A, Output_B → Annotators prefer Output_A
  2. Reward model training:

    • Train classifier to predict human preferences: r(prompt, output) ∈ [0, 1]
    • Loss: Binary cross-entropy on preference labels
  3. RL fine-tuning:

    • Use reward model as objective function
    • Fine-tune language model via policy gradient (e.g., PPO)
    • Goal: Maximize expected reward while staying close to original model
  4. Output: Model optimized for human preferences (helpful, harmless, honest)

Example

# Data collection
Prompt: "Explain quantum computing"
Output_A: "Quantum computers use qubits..."
Output_B: "Quantum computing is really complex stuff"

Annotator: "Output_A is more informative" → label_A > label_B

# Reward model
r_model(prompt, output_A) = 0.8
r_model(prompt, output_B) = 0.3

# RL fine-tuning
RL objective: max E[r_model(prompt, model(prompt))]
Constraint: KL divergence to base model small

Variants and history

RLHF was pioneered at Anthropic (Constitutional AI) and applied to GPT-3 by OpenAI (Ouyang et al., 2022), resulting in InstructGPT. Large-scale RLHF enabled GPT-3.5, which dramatically improved usability. Variants include DPO (Direct Preference Optimization, eliminating reward model), IPO (Iterative Preference Optimization), and constitutional AI (replacing human feedback with AI-defined principles). RLHF is crucial for safe, helpful LLMs.

When to use it

Use RLHF when:

  • Human preference alignment matters (safety, helpfulness)
  • You have budget for human annotation
  • Base language model exists and is functional
  • You want to optimize for multiple criteria (avoid helpfulness-safety tradeoff)
  • Post-training refinement is necessary

RLHF is expensive (requires human annotators) and technically complex. Benefit: significant improvements in usability, safety, and alignment. DPO offers cheaper alternative if high-quality preference data available.

See also