Knowledge Distillation for IR
What it is
Knowledge distillation for IR trains a fast bi-encoder retrieval model (student) to reproduce the relevance scores of a slow but accurate cross-encoder (teacher). The cross-encoder’s scores — computed offline for training pairs — serve as soft labels. The student learns a richer ranking signal than binary relevance provides, while maintaining bi-encoder inference speed at query time.
[illustrate: Cross-encoder teacher scores (query, passage) pairs offline → soft labels; bi-encoder student trained on soft labels → fast inference]
How it works
-
Teacher scoring (offline):
- Run cross-encoder over all (query, passage) training pairs
- Produces a continuous relevance score per pair
- This is expensive but done once before training
-
Distillation loss:
- Margin MSE: minimize (student_score(q, d+) - student_score(q, d-)) - (teacher_score(q, d+) - teacher_score(q, d-))²
- KL divergence: match the full score distribution over candidate passages
- Rank consistency: ensure student ranking order matches teacher’s
-
Combined training:
- Hard labels (binary relevance) + soft labels (teacher scores)
- Weighted combination: λ · distillation_loss + (1-λ) · hard_label_loss
Example
Training triplet with teacher:
query: "What is photosynthesis?"
positive: "Photosynthesis is the process by which plants..."
negative: "Cellular respiration produces ATP from glucose..."
Teacher (cross-encoder) scores:
score(q, positive) = 0.95
score(q, negative) = 0.12
Student (bi-encoder) current scores:
score(q, positive) = 0.71
score(q, negative) = 0.45 ← student not yet distinguishing well
Margin MSE loss:
target_margin = 0.95 - 0.12 = 0.83
student_margin = 0.71 - 0.45 = 0.26
loss = (0.83 - 0.26)² = 0.32 ← large gradient, strong update
Variants and history
Knowledge distillation for IR was popularized by TAS-B (2021) and SPLADE++ (2022). ColBERTv2 uses distillation to train its late-interaction model. GPL (Generative Pseudo Labeling) applies distillation to domain adaptation: generate synthetic queries, rank with a cross-encoder, distill to a bi-encoder. The teacher need not be a cross-encoder — any stronger model (larger bi-encoder, ColBERT) can serve as teacher.
When to use it
Use knowledge distillation when:
- A cross-encoder (or other strong model) is available as a teacher
- You want bi-encoder inference speed with cross-encoder-like quality
- Training data quality needs to be improved beyond binary labels
- Domain adaptation with GPL-style synthetic data is needed