BLEU Score

What it is

BLEU (Bilingual Evaluation Understudy) is an automatic metric for evaluating machine translation quality based on n-gram overlap with reference translations. It compares system output against one or more human reference translations, measuring match proportion of n-grams (unigrams, bigrams, trigrams, 4-grams). BLEU is fast, language-independent, and became the standard MT evaluation metric (though limitations recognized).

[illustrate: System translation compared to reference translations; n-gram matches highlighted; BLEU calculation]

How it works

  1. N-gram precision:

    • For each n-gram size (1, 2, 3, 4):
    • Compute: p_n = (# matched n-grams) / (# total n-grams in system output)
  2. Brevity penalty:

    • Penalize if system output shorter than references
    • BP = exp(1 - reference_length / output_length) if output < reference
  3. BLEU:

    BLEU = BP × exp(Σ w_n × log(p_n))
    

    Where w_n = weight for n-gram (often 0.25 for each)

  4. Score range: 0–100 (or 0–1)

Example

Reference: "The cat is on the mat"
System:    "The cat is on the mat"
Perfect match: BLEU = 100

Reference: "The cat is on the mat"
System:    "A cat is on the mat"
Differences:
  1-gram: 5/6 match
  2-gram: 3/5 match
  3-gram: 2/4 match
  4-gram: 1/3 match

BLEU ≈ 0.25×(5/6) + 0.25×(3/5) + 0.25×(2/4) + 0.25×(1/3)
     ≈ 0.21 + 0.15 + 0.125 + 0.083
     ≈ 0.568 (56.8)

With brevity penalty if system shorter...

Variants and history

BLEU published by Papineni et al. (2002), revolutionizing MT evaluation. Limitations recognized:

  • Doesn’t account for synonymy
  • Single reference may not capture valid alternatives
  • Character-level variations (capitalization) penalized
  • Correlates with human judgment ~0.4 (moderate)

Variants: SacreBLEU (reproducible BLEU), chrF (character-level), TER (edit distance), METEOR (synonym-aware). Neural metrics (BERTScore, COMETScore) better correlate with human judgment but less transparent.

When to use it

Use BLEU when:

  • Baseline MT evaluation (industry standard)
  • Comparing systems on same benchmark
  • Language pair studied well (correlations established)
  • Need transparent, fast metric
  • Historical comparison with published systems

BLEU limitations: use together with human evaluation or neural metrics (BERTScore) for robust assessment. Not recommended as sole metric for modern systems.

See also