Knowledge Distillation for IR

What it is

Knowledge distillation for IR trains a fast bi-encoder retrieval model (student) to reproduce the relevance scores of a slow but accurate cross-encoder (teacher). The cross-encoder’s scores — computed offline for training pairs — serve as soft labels. The student learns a richer ranking signal than binary relevance provides, while maintaining bi-encoder inference speed at query time.

[illustrate: Cross-encoder teacher scores (query, passage) pairs offline → soft labels; bi-encoder student trained on soft labels → fast inference]

How it works

  1. Teacher scoring (offline):

    • Run cross-encoder over all (query, passage) training pairs
    • Produces a continuous relevance score per pair
    • This is expensive but done once before training
  2. Distillation loss:

    • Margin MSE: minimize (student_score(q, d+) - student_score(q, d-)) - (teacher_score(q, d+) - teacher_score(q, d-))²
    • KL divergence: match the full score distribution over candidate passages
    • Rank consistency: ensure student ranking order matches teacher’s
  3. Combined training:

    • Hard labels (binary relevance) + soft labels (teacher scores)
    • Weighted combination: λ · distillation_loss + (1-λ) · hard_label_loss

Example

Training triplet with teacher:
  query:    "What is photosynthesis?"
  positive: "Photosynthesis is the process by which plants..."
  negative: "Cellular respiration produces ATP from glucose..."

Teacher (cross-encoder) scores:
  score(q, positive) = 0.95
  score(q, negative) = 0.12

Student (bi-encoder) current scores:
  score(q, positive) = 0.71
  score(q, negative) = 0.45   ← student not yet distinguishing well

Margin MSE loss:
  target_margin = 0.95 - 0.12 = 0.83
  student_margin = 0.71 - 0.45 = 0.26
  loss = (0.83 - 0.26)² = 0.32   ← large gradient, strong update

Variants and history

Knowledge distillation for IR was popularized by TAS-B (2021) and SPLADE++ (2022). ColBERTv2 uses distillation to train its late-interaction model. GPL (Generative Pseudo Labeling) applies distillation to domain adaptation: generate synthetic queries, rank with a cross-encoder, distill to a bi-encoder. The teacher need not be a cross-encoder — any stronger model (larger bi-encoder, ColBERT) can serve as teacher.

When to use it

Use knowledge distillation when:

  • A cross-encoder (or other strong model) is available as a teacher
  • You want bi-encoder inference speed with cross-encoder-like quality
  • Training data quality needs to be improved beyond binary labels
  • Domain adaptation with GPL-style synthetic data is needed

See also