ROUGE

What it is

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is an automatic metric for evaluating abstractive and extractive summarization by measuring n-gram overlap between system output (summary) and reference summaries. Unlike BLEU (precision-focused), ROUGE emphasizes recall. Variants: ROUGE-N (n-gram), ROUGE-L (longest common subsequence), ROUGE-W (weighted LCS).

[illustrate: System summary compared to reference summaries; n-gram and LCS matches highlighted]

How it works

ROUGE-N:

  • Count n-gram matches between system and reference
  • Recall: (# matched n-grams) / (# n-grams in reference)
  • Typically report ROUGE-1 (unigrams), ROUGE-2 (bigrams)
  • Higher recall than precision emphasis

ROUGE-L:

  • Longest Common Subsequence (LCS) between system and reference
  • Captures word order preservation
  • Better for paraphrasing evaluation

ROUGE-W:

  • Weighted LCS giving credit to consecutive matches
  • Encourages fluency

Formula (ROUGE-N):

ROUGE = (# n-gram matches) / (# n-grams in reference)

Example

Reference: "The quick brown fox jumps over the lazy dog"
System:    "A quick brown fox jumps over a lazy dog"

ROUGE-1 (unigrams):
Matched: {quick, brown, fox, jumps, over, lazy, dog} = 7
Reference total: 9
ROUGE-1 = 7/9 = 0.78

ROUGE-2 (bigrams):
Matched: {quick brown, brown fox, fox jumps, jumps over, over lazy, lazy dog} = 6
Reference bigrams: 8
ROUGE-2 = 6/8 = 0.75

ROUGE-L (LCS):
LCS: "quick brown fox jumps over lazy dog" (length 7)
Reference length: 9
ROUGE-L ≈ 0.78 (with additional penalty/reward)

Variants and history

ROUGE published by Lin (2004) for summarization evaluation. Became standard in summarization (TAC, DUC). ROUGE-SU (skip-bigrams) tolerates words out of order. Multi-reference ROUGE averages across multiple valid summaries. Neural metrics (BERTScore, BLEURT) show better correlation with human judgment. ROUGE widely adopted but limitations recognized (synonymy, paraphrasing).

When to use it

Use ROUGE when:

  • Evaluating summarization systems
  • Benchmark system comparisons (TAC, SQuAD)
  • Need fast, interpretable automatic metric
  • Reference summaries available
  • Recall-focused evaluation important

ROUGE limitations: doesn’t account for synonymy, paraphrasing, or semantic equivalence. Combine with human evaluation or BERTScore for robust assessment.

See also