BLEU Score

What it is

BLEU (Bilingual Evaluation Understudy) is an automatic metric for evaluating machine translation quality based on n-gram overlap with reference translations. It compares system output against one or more human reference translations, measuring match proportion of n-grams (unigrams, bigrams, trigrams, 4-grams). BLEU is fast, language-independent, and became the standard MT evaluation metric (though limitations recognized).

[illustrate: System translation compared to reference translations; n-gram matches highlighted; BLEU calculation]

How it works

N-gram precision:
- For each n-gram size (1, 2, 3, 4):
- Compute: p_n = (# matched n-grams) / (# total n-grams in system output)
Brevity penalty:
- Penalize if system output shorter than references
- BP = exp(1 - reference_length / output_length) if output < reference
BLEU:
```
BLEU = BP × exp(Σ w_n × log(p_n))
```
Where w_n = weight for n-gram (often 0.25 for each)
Score range: 0–100 (or 0–1)

Example

Reference: "The cat is on the mat"
System:    "The cat is on the mat"
Perfect match: BLEU = 100

Reference: "The cat is on the mat"
System:    "A cat is on the mat"
Differences:
  1-gram: 5/6 match
  2-gram: 3/5 match
  3-gram: 2/4 match
  4-gram: 1/3 match

BLEU ≈ 0.25×(5/6) + 0.25×(3/5) + 0.25×(2/4) + 0.25×(1/3)
     ≈ 0.21 + 0.15 + 0.125 + 0.083
     ≈ 0.568 (56.8)

With brevity penalty if system shorter...

Variants and history

BLEU published by Papineni et al. (2002), revolutionizing MT evaluation. Limitations recognized:

Doesn’t account for synonymy
Single reference may not capture valid alternatives
Character-level variations (capitalization) penalized
Correlates with human judgment ~0.4 (moderate)

Variants: SacreBLEU (reproducible BLEU), chrF (character-level), TER (edit distance), METEOR (synonym-aware). Neural metrics (BERTScore, COMETScore) better correlate with human judgment but less transparent.

When to use it

Use BLEU when:

Baseline MT evaluation (industry standard)
Comparing systems on same benchmark
Language pair studied well (correlations established)
Need transparent, fast metric
Historical comparison with published systems

BLEU limitations: use together with human evaluation or neural metrics (BERTScore) for robust assessment. Not recommended as sole metric for modern systems.