Contrastive Loss

What it is

Contrastive loss is a family of training objectives that shape embedding space by minimizing distance between similar (positive) pairs and maximizing distance between dissimilar (negative) pairs. The InfoNCE (Noise Contrastive Estimation) variant dominates modern dense retrieval and sentence embedding training. It frames learning as a classification problem: given a query, identify the correct positive from a set of negatives.

[illustrate: Query embedding pulled toward positive, pushed from negatives; InfoNCE loss landscape; temperature scaling effect]

How it works

InfoNCE loss

For a batch of N (query, positive) pairs:

score(q, d) = q · d / τ           # dot product, temperature-scaled

L = -log( exp(score(q, d+)) / Σ_j exp(score(q, d_j)) )

Where:

  • d+ is the positive document for query q
  • d_j ranges over d+ and all negatives in the batch
  • τ is a temperature hyperparameter (lower = sharper distribution)

Temperature scaling

  • Low temperature (τ = 0.01–0.05): hard distribution, large gradient from hard negatives, unstable training
  • High temperature (τ = 0.1–0.5): soft distribution, stable but slower convergence
  • Typical value: τ = 0.05

What makes a good negative

  • Random negatives: easy to distinguish, weak training signal
  • In-batch negatives: free from batch construction, moderate difficulty
  • BM25 hard negatives: lexically similar but not relevant, harder
  • Mined hard negatives (ANCE): passages the current model scores highly but are not relevant — hardest and most informative

Example

import torch
import torch.nn.functional as F

def infonce_loss(query_embs, pos_embs, temperature=0.05):
    # query_embs: [B, d], pos_embs: [B, d]
    # In-batch negatives: all other positives serve as negatives

    # Normalize
    q = F.normalize(query_embs, dim=-1)
    p = F.normalize(pos_embs, dim=-1)

    # Similarity matrix: [B, B]
    sim = torch.matmul(q, p.T) / temperature

    # Labels: diagonal is positive pair
    labels = torch.arange(sim.size(0), device=sim.device)

    return F.cross_entropy(sim, labels)

Variants and history

Contrastive loss originates in SimCLR and MoCo (2020, computer vision). Applied to NLP in SimCSE (2021) and explicitly used in DPR training. NT-Xent (SimCLR loss) and InfoNCE are algebraically equivalent. Supervised contrastive loss adds explicit hard negatives alongside in-batch. SupCon (Khosla et al., 2020) extends to multiple positives per anchor.

When to use it

InfoNCE is the default choice for:

  • Dense retrieval model training (DPR, Contriever)
  • Sentence embedding training (SimCSE, E5)
  • Any task where you have (anchor, positive) pairs and want embedding alignment

Key decisions: temperature, negative strategy (in-batch vs. hard mined), batch size (larger = more negatives = better signal).

See also