Perplexity
What it is
Perplexity (PPL) is an evaluation metric for language models measuring how surprised the model is by a test set. Mathematically, it is the exponentiated negative average log-probability (cross-entropy) of test tokens. Lower perplexity indicates better prediction accuracy; a model assigning high probability to observed tokens has low perplexity.
[illustrate: Language model probability distributions over two test sequences; calculating log-probabilities and averaging; exponentiation to perplexity]
How it works
For a test sequence of N tokens:
Cross-entropy = -(1/N) × Σ log P(t_i | context_i)
Perplexity = exp(cross-entropy) = exp(-(1/N) × Σ log P(t_i | context_i))
Interpretation: Perplexity of k means the model is “equally confused” as if predicting uniformly among k tokens. A model with perplexity 50 on a task is as uncertain as random choice among 50 equally likely options.
Example
# Test sequence: "the quick brown fox"
# Language model probabilities:
P("the" | START) = 0.3 → log P = -1.20
P("quick" | "the") = 0.4 → log P = -0.92
P("brown" | "the quick") = 0.5 → log P = -0.69
P("fox" | "the quick brown") = 0.6 → log P = -0.51
Cross-entropy = -(1/4) × (-1.20 - 0.92 - 0.69 - 0.51) = 0.58
Perplexity = exp(0.58) ≈ 1.78
# Model predicts equally well among ~1.8 tokens on average
Variants and history
Perplexity was introduced in information theory and applied to language modeling in the 1990s. It became the standard evaluation metric for n-gram and neural language models. Recent work questions whether perplexity alone captures downstream task performance; other metrics (BLEU, ROUGE, task-specific accuracy) are often more relevant. Perplexity remains useful for comparing model architectures and training approaches on the same benchmark.
When to use it
Use perplexity when:
- Comparing language models on same test set
- Evaluating language model pre-training
- Intrinsic evaluation independent of downstream tasks
- Tracking training progress (validation perplexity)
- Comparing to published benchmarks (WikiText, Penn Treebank)
Perplexity is useful but limited: a model with low perplexity may still perform poorly on downstream tasks if objectives diverge.