ROUGE
What it is
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is an automatic metric for evaluating abstractive and extractive summarization by measuring n-gram overlap between system output (summary) and reference summaries. Unlike BLEU (precision-focused), ROUGE emphasizes recall. Variants: ROUGE-N (n-gram), ROUGE-L (longest common subsequence), ROUGE-W (weighted LCS).
[illustrate: System summary compared to reference summaries; n-gram and LCS matches highlighted]
How it works
ROUGE-N:
- Count n-gram matches between system and reference
- Recall: (# matched n-grams) / (# n-grams in reference)
- Typically report ROUGE-1 (unigrams), ROUGE-2 (bigrams)
- Higher recall than precision emphasis
ROUGE-L:
- Longest Common Subsequence (LCS) between system and reference
- Captures word order preservation
- Better for paraphrasing evaluation
ROUGE-W:
- Weighted LCS giving credit to consecutive matches
- Encourages fluency
Formula (ROUGE-N):
ROUGE = (# n-gram matches) / (# n-grams in reference)
Example
Reference: "The quick brown fox jumps over the lazy dog"
System: "A quick brown fox jumps over a lazy dog"
ROUGE-1 (unigrams):
Matched: {quick, brown, fox, jumps, over, lazy, dog} = 7
Reference total: 9
ROUGE-1 = 7/9 = 0.78
ROUGE-2 (bigrams):
Matched: {quick brown, brown fox, fox jumps, jumps over, over lazy, lazy dog} = 6
Reference bigrams: 8
ROUGE-2 = 6/8 = 0.75
ROUGE-L (LCS):
LCS: "quick brown fox jumps over lazy dog" (length 7)
Reference length: 9
ROUGE-L ≈ 0.78 (with additional penalty/reward)
Variants and history
ROUGE published by Lin (2004) for summarization evaluation. Became standard in summarization (TAC, DUC). ROUGE-SU (skip-bigrams) tolerates words out of order. Multi-reference ROUGE averages across multiple valid summaries. Neural metrics (BERTScore, BLEURT) show better correlation with human judgment. ROUGE widely adopted but limitations recognized (synonymy, paraphrasing).
When to use it
Use ROUGE when:
- Evaluating summarization systems
- Benchmark system comparisons (TAC, SQuAD)
- Need fast, interpretable automatic metric
- Reference summaries available
- Recall-focused evaluation important
ROUGE limitations: doesn’t account for synonymy, paraphrasing, or semantic equivalence. Combine with human evaluation or BERTScore for robust assessment.