SimCSE
What it is
SimCSE (Gao et al., 2021) is a contrastive learning framework for sentence embeddings. In its unsupervised form, it creates positive pairs by passing the same sentence through a BERT encoder twice with different dropout masks — the stochastic dropout acts as minimal data augmentation. In its supervised form, it uses Natural Language Inference (NLI) entailment pairs as positives and contradiction pairs as hard negatives. SimCSE substantially improved the state of the art on sentence similarity benchmarks and influenced nearly all subsequent embedding models.
[illustrate: Same sentence → two dropout-augmented forward passes → contrastive loss; or NLI: entailment=positive, contradiction=hard negative]
How it works
-
Unsupervised SimCSE:
- Encode sentence x twice with different dropout masks: z, z'
- Treat (z, z’) as a positive pair
- All other sentences in the batch are negatives
- Loss: InfoNCE over dot products
-
Supervised SimCSE:
- Positive: sentence entails hypothesis (NLI dataset)
- Hard negative: sentence contradicts hypothesis
- Loss: cross-entropy with hard negatives
-
Alignment and uniformity:
- SimCSE analyzes why it works via alignment (positives close together) and uniformity (embeddings spread across sphere)
- Dropout augmentation improves both metrics vs. anisotropic BERT embeddings
Example
# Unsupervised: same sentence, two forward passes
sentence = "The cat sat on the mat."
with torch.no_grad():
# dropout enabled at train time — different masks
emb1 = model(sentence, dropout_rate=0.1)
emb2 = model(sentence, dropout_rate=0.1)
# emb1 and emb2 are positive pair
# All other batch sentences are negatives
loss = infonce(emb1, emb2, all_batch_embeddings)
Variants and history
SimCSE (2021) demonstrated that minimal augmentation (dropout noise) is sufficient for strong unsupervised embeddings, and that supervised NLI data provides a significant further boost. It directly inspired E5, BGE, and the broader wave of instruction-tuned embedding models. The alignment/uniformity analysis became a standard diagnostic for embedding quality.
When to use it
Use SimCSE-style training when:
- You need strong sentence embeddings for semantic similarity or retrieval
- No domain-specific labeled data is available (use unsupervised variant)
- NLI data is available (use supervised variant for higher quality)
- You want a well-understood, reproducible baseline
For retrieval specifically, DPR-style training with domain-specific labels outperforms SimCSE when such data exists.