Perplexity

What it is

Perplexity (PPL) is an evaluation metric for language models measuring how surprised the model is by a test set. Mathematically, it is the exponentiated negative average log-probability (cross-entropy) of test tokens. Lower perplexity indicates better prediction accuracy; a model assigning high probability to observed tokens has low perplexity.

[illustrate: Language model probability distributions over two test sequences; calculating log-probabilities and averaging; exponentiation to perplexity]

How it works

For a test sequence of N tokens:

Cross-entropy = -(1/N) × Σ log P(t_i | context_i)

Perplexity = exp(cross-entropy) = exp(-(1/N) × Σ log P(t_i | context_i))

Interpretation: Perplexity of k means the model is “equally confused” as if predicting uniformly among k tokens. A model with perplexity 50 on a task is as uncertain as random choice among 50 equally likely options.

Example

# Test sequence: "the quick brown fox"
# Language model probabilities:
P("the" | START) = 0.3       → log P = -1.20
P("quick" | "the") = 0.4     → log P = -0.92
P("brown" | "the quick") = 0.5 → log P = -0.69
P("fox" | "the quick brown") = 0.6 → log P = -0.51

Cross-entropy = -(1/4) × (-1.20 - 0.92 - 0.69 - 0.51) = 0.58
Perplexity = exp(0.58) ≈ 1.78

# Model predicts equally well among ~1.8 tokens on average

Variants and history

Perplexity was introduced in information theory and applied to language modeling in the 1990s. It became the standard evaluation metric for n-gram and neural language models. Recent work questions whether perplexity alone captures downstream task performance; other metrics (BLEU, ROUGE, task-specific accuracy) are often more relevant. Perplexity remains useful for comparing model architectures and training approaches on the same benchmark.

When to use it

Use perplexity when:

  • Comparing language models on same test set
  • Evaluating language model pre-training
  • Intrinsic evaluation independent of downstream tasks
  • Tracking training progress (validation perplexity)
  • Comparing to published benchmarks (WikiText, Penn Treebank)

Perplexity is useful but limited: a model with low perplexity may still perform poorly on downstream tasks if objectives diverge.

See also