In-Batch Negatives

What it is

In-batch negatives is a training technique for contrastive learning where, for any given (query, positive_passage) pair in a mini-batch, all other passages in the batch are treated as negatives. This gives B-1 negatives per query for free at no additional memory or compute cost — the similarity matrix is already computed for the InfoNCE loss. Larger batches provide harder and more diverse negatives, which is why training dense retrieval models on large batch sizes (512–4096) is standard.

[illustrate: B×B similarity matrix; diagonal = positives; off-diagonal = in-batch negatives; cross-entropy per row]

How it works

Batch of 4 pairs:
  (q1, p1+), (q2, p2+), (q3, p3+), (q4, p4+)

Similarity matrix:
       p1+   p2+   p3+   p4+
  q1 [ 0.9   0.2   0.1   0.3 ]  → q1's positive is p1+; negatives are p2+, p3+, p4+
  q2 [ 0.1   0.8   0.3   0.2 ]  → q2's positive is p2+; negatives are p1+, p3+, p4+
  q3 [ 0.2   0.3   0.9   0.1 ]
  q4 [ 0.3   0.1   0.2   0.7 ]

Loss = cross_entropy(sim_matrix, [0, 1, 2, 3])

Limitations

  1. False negatives: another batch passage may actually be relevant to a query (especially with large batches and overlapping topics)
  2. Easy negatives at small batch sizes: randomly sampled passages are usually easy to distinguish from positives
  3. GPU memory constraint: batch size is limited by embedding dimension × batch size × 2 (query + passage)

Interaction with batch size

In-batch negatives scale with batch size: a batch of 512 gives 511 negatives per query. Large-batch training (using gradient accumulation or multi-GPU) is often more important than architecture choices. DPR used a batch size of 128; modern models commonly use 2048–8192 with FAISS-accelerated in-batch mining.

Variants and history

In-batch negatives became standard with DPR (2020). Cross-batch negatives use a momentum encoder and queue (like MoCo) to store embeddings across batches, decoupling negative count from batch size. Denoised negatives (RocketQA) use a cross-encoder to filter false negatives from the batch before computing loss.

When to use it

In-batch negatives are the starting point for all dense retrieval training. Always use them. The question is whether to augment with hard negatives:

  • In-batch only: fast, good baseline, limited by batch size
  • In-batch + BM25 hard negatives: standard DPR setup
  • In-batch + dynamic hard negatives (ANCE): highest quality, most expensive

See also