BERTScore
What it is
BERTScore is an automatic evaluation metric using contextual word embeddings (BERT) to compute semantic similarity between system output and reference text. Unlike n-gram metrics (BLEU, ROUGE), BERTScore captures paraphrases, synonyms, and semantic equivalence. It shows stronger correlation with human judgment on translation, summarization, and generation tasks.
[illustrate: System and reference text encoded with BERT; token-wise semantic similarity matrix; greedy matching visualization]
How it works
-
Encoding:
- Encode system output and reference(s) with BERT
- Each token gets contextual embedding
-
Token-level matching:
- For each system token, find most similar reference token (greedy matching)
- Compute cosine similarity between embeddings
-
Aggregation:
- Precision: average similarity of system tokens to best reference matches
- Recall: average similarity of reference tokens to best system matches
- F1: harmonic mean of precision and recall
-
Formulas:
Precision = (1/|s|) × Σ max_r (cos(s_i, r_j)) Recall = (1/|r|) × Σ max_s (cos(r_i, s_j)) BERTScore = 2 × (P × R) / (P + R)
Example
Reference: "The cat is sleeping"
System1: "The cat is sleeping" (exact match)
System2: "A feline is napping" (paraphrase)
With BLEU:
System1: perfect match → BLEU = 100
System2: minimal n-gram overlap → BLEU ≈ 25
With BERTScore:
System1: all tokens match → BERTScore ≈ 1.0
System2: "feline"≈"cat", "napping"≈"sleeping" → BERTScore ≈ 0.95
BERTScore captures semantic equivalence; BLEU penalizes synonymy.
Variants and history
BERTScore introduced by Zhang et al. (2019). Variants:
- BaseBERT vs. SciBERT (domain-specific)
- Multilingual BERT for cross-lingual evaluation
- mBERTScore for multiple languages
- BERTScore with idf weighting down-weight common words
Shown to correlate better with human judgment than BLEU, ROUGE. Limitations: requires BERT model (not fully interpretable); slower than n-gram metrics; language-dependent.
When to use it
Use BERTScore when:
- Semantic equivalence matters (synonymy, paraphrasing)
- Evaluating generation tasks (translation, summarization, QA)
- Stronger human correlation needed than BLEU/ROUGE
- References may be paraphrases (not exact matches)
- Modern NLP evaluation expected
BERTScore slower than BLEU but captures meaning better. Combine with human evaluation for critical applications; good for automatic benchmarking.