BERTScore

What it is

BERTScore is an automatic evaluation metric using contextual word embeddings (BERT) to compute semantic similarity between system output and reference text. Unlike n-gram metrics (BLEU, ROUGE), BERTScore captures paraphrases, synonyms, and semantic equivalence. It shows stronger correlation with human judgment on translation, summarization, and generation tasks.

[illustrate: System and reference text encoded with BERT; token-wise semantic similarity matrix; greedy matching visualization]

How it works

Encoding:
- Encode system output and reference(s) with BERT
- Each token gets contextual embedding
Token-level matching:
- For each system token, find most similar reference token (greedy matching)
- Compute cosine similarity between embeddings
Aggregation:
- Precision: average similarity of system tokens to best reference matches
- Recall: average similarity of reference tokens to best system matches
- F1: harmonic mean of precision and recall

Formulas:

Precision = (1/|s|) × Σ max_r (cos(s_i, r_j))
Recall = (1/|r|) × Σ max_s (cos(r_i, s_j))
BERTScore = 2 × (P × R) / (P + R)

Example

Reference: "The cat is sleeping"
System1:   "The cat is sleeping"  (exact match)
System2:   "A feline is napping"  (paraphrase)

With BLEU:
System1: perfect match → BLEU = 100
System2: minimal n-gram overlap → BLEU ≈ 25

With BERTScore:
System1: all tokens match → BERTScore ≈ 1.0
System2: "feline"≈"cat", "napping"≈"sleeping" → BERTScore ≈ 0.95

BERTScore captures semantic equivalence; BLEU penalizes synonymy.

Variants and history

BERTScore introduced by Zhang et al. (2019). Variants:

BaseBERT vs. SciBERT (domain-specific)
Multilingual BERT for cross-lingual evaluation
mBERTScore for multiple languages
BERTScore with idf weighting down-weight common words

Shown to correlate better with human judgment than BLEU, ROUGE. Limitations: requires BERT model (not fully interpretable); slower than n-gram metrics; language-dependent.

When to use it

Use BERTScore when:

Semantic equivalence matters (synonymy, paraphrasing)
Evaluating generation tasks (translation, summarization, QA)
Stronger human correlation needed than BLEU/ROUGE
References may be paraphrases (not exact matches)
Modern NLP evaluation expected

BERTScore slower than BLEU but captures meaning better. Combine with human evaluation for critical applications; good for automatic benchmarking.