Listwise Ranking Loss
What it is
Listwise ranking losses optimize over an entire list of candidates for a query, rather than scoring each document independently (pointwise) or comparing pairs (pairwise). They directly optimize ranking metrics like NDCG or provide a tighter bound on the final ranking quality. The trade-off: more computationally expensive than pointwise losses, require seeing all candidates at once, but produce better-calibrated ranking models.
[illustrate: Pointwise: score each doc independently; pairwise: compare (doc_i, doc_j); listwise: optimize over full ranked list jointly]
The three approaches
Pointwise
- Score each (query, document) pair independently
- Loss: cross-entropy(score, binary_label)
- Examples: MonoBERT, MonoT5
- Ignores relative ordering between documents
Pairwise
- Compare (positive, negative) pairs
- Loss: margin loss or hinge loss on score(q, d+) - score(q, d-)
- Examples: RankNet, LambdaRank
- Ignores global list structure
Listwise
- Score all candidates jointly
- Loss: directly approximates ranking metric
- Examples: ListNet, LambdaLoss, ApproxNDCG, SoftmaxCE
Key listwise losses
SoftmaxCE / InfoNCE: treat the positive as the class to predict among all candidates. Used in contrastive retrieval training.
ListNet: convert scores and labels to probability distributions via softmax, minimize KL divergence.
ApproxNDCG (Qin et al., 2010): differentiable approximation to NDCG using a sigmoid approximation to the indicator function.
LambdaLoss (Wang et al., 2018): derives loss weights from the expected gain in NDCG from a swap. The most theoretically principled approach; used in commercial search ranking.
# ApproxNDCG intuition
def approx_ndcg_loss(scores, labels):
# scores: [k] model scores for k candidates
# labels: [k] graded relevance labels
# Approximate rank using sigmoid: how many docs score above doc i?
approx_ranks = 1 + torch.sum(
torch.sigmoid((scores.unsqueeze(1) - scores.unsqueeze(0)) / sigma),
dim=1
)
# Compute NDCG with approximate ranks, negate for gradient descent
gain = (2**labels - 1) / torch.log2(approx_ranks + 1)
return -gain.sum() / ideal_dcg(labels)
Variants and history
Pairwise and listwise losses were core to the learning-to-rank literature (2005–2010: RankNet, LambdaRank, LambdaMART). Applied to neural reranking in the transformer era: RankT5 optimizes a listwise softmax; LambdaLoss is used in Google’s production ranking. The InfoNCE loss used in contrastive retrieval is technically listwise (optimizes over a list of negatives).
When to use it
- Pointwise (MonoBERT style): simplest, works well with enough data
- Pairwise (RankNet style): good when comparative judgments are available
- Listwise (LambdaLoss, ApproxNDCG): use when NDCG is the evaluation metric and you have graded relevance labels; requires all candidates in memory simultaneously