BERTScore

What it is

BERTScore is an automatic evaluation metric using contextual word embeddings (BERT) to compute semantic similarity between system output and reference text. Unlike n-gram metrics (BLEU, ROUGE), BERTScore captures paraphrases, synonyms, and semantic equivalence. It shows stronger correlation with human judgment on translation, summarization, and generation tasks.

[illustrate: System and reference text encoded with BERT; token-wise semantic similarity matrix; greedy matching visualization]

How it works

  1. Encoding:

    • Encode system output and reference(s) with BERT
    • Each token gets contextual embedding
  2. Token-level matching:

    • For each system token, find most similar reference token (greedy matching)
    • Compute cosine similarity between embeddings
  3. Aggregation:

    • Precision: average similarity of system tokens to best reference matches
    • Recall: average similarity of reference tokens to best system matches
    • F1: harmonic mean of precision and recall
  4. Formulas:

    Precision = (1/|s|) × Σ max_r (cos(s_i, r_j))
    Recall = (1/|r|) × Σ max_s (cos(r_i, s_j))
    BERTScore = 2 × (P × R) / (P + R)
    

Example

Reference: "The cat is sleeping"
System1:   "The cat is sleeping"  (exact match)
System2:   "A feline is napping"  (paraphrase)

With BLEU:
System1: perfect match → BLEU = 100
System2: minimal n-gram overlap → BLEU ≈ 25

With BERTScore:
System1: all tokens match → BERTScore ≈ 1.0
System2: "feline"≈"cat", "napping"≈"sleeping" → BERTScore ≈ 0.95

BERTScore captures semantic equivalence; BLEU penalizes synonymy.

Variants and history

BERTScore introduced by Zhang et al. (2019). Variants:

  • BaseBERT vs. SciBERT (domain-specific)
  • Multilingual BERT for cross-lingual evaluation
  • mBERTScore for multiple languages
  • BERTScore with idf weighting down-weight common words

Shown to correlate better with human judgment than BLEU, ROUGE. Limitations: requires BERT model (not fully interpretable); slower than n-gram metrics; language-dependent.

When to use it

Use BERTScore when:

  • Semantic equivalence matters (synonymy, paraphrasing)
  • Evaluating generation tasks (translation, summarization, QA)
  • Stronger human correlation needed than BLEU/ROUGE
  • References may be paraphrases (not exact matches)
  • Modern NLP evaluation expected

BERTScore slower than BLEU but captures meaning better. Combine with human evaluation for critical applications; good for automatic benchmarking.

See also