BLEU Score
What it is
BLEU (Bilingual Evaluation Understudy) is an automatic metric for evaluating machine translation quality based on n-gram overlap with reference translations. It compares system output against one or more human reference translations, measuring match proportion of n-grams (unigrams, bigrams, trigrams, 4-grams). BLEU is fast, language-independent, and became the standard MT evaluation metric (though limitations recognized).
[illustrate: System translation compared to reference translations; n-gram matches highlighted; BLEU calculation]
How it works
-
N-gram precision:
- For each n-gram size (1, 2, 3, 4):
- Compute: p_n = (# matched n-grams) / (# total n-grams in system output)
-
Brevity penalty:
- Penalize if system output shorter than references
- BP = exp(1 - reference_length / output_length) if output < reference
-
BLEU:
BLEU = BP × exp(Σ w_n × log(p_n))Where w_n = weight for n-gram (often 0.25 for each)
-
Score range: 0–100 (or 0–1)
Example
Reference: "The cat is on the mat"
System: "The cat is on the mat"
Perfect match: BLEU = 100
Reference: "The cat is on the mat"
System: "A cat is on the mat"
Differences:
1-gram: 5/6 match
2-gram: 3/5 match
3-gram: 2/4 match
4-gram: 1/3 match
BLEU ≈ 0.25×(5/6) + 0.25×(3/5) + 0.25×(2/4) + 0.25×(1/3)
≈ 0.21 + 0.15 + 0.125 + 0.083
≈ 0.568 (56.8)
With brevity penalty if system shorter...
Variants and history
BLEU published by Papineni et al. (2002), revolutionizing MT evaluation. Limitations recognized:
- Doesn’t account for synonymy
- Single reference may not capture valid alternatives
- Character-level variations (capitalization) penalized
- Correlates with human judgment ~0.4 (moderate)
Variants: SacreBLEU (reproducible BLEU), chrF (character-level), TER (edit distance), METEOR (synonym-aware). Neural metrics (BERTScore, COMETScore) better correlate with human judgment but less transparent.
When to use it
Use BLEU when:
- Baseline MT evaluation (industry standard)
- Comparing systems on same benchmark
- Language pair studied well (correlations established)
- Need transparent, fast metric
- Historical comparison with published systems
BLEU limitations: use together with human evaluation or neural metrics (BERTScore) for robust assessment. Not recommended as sole metric for modern systems.