Listwise Ranking Loss

What it is

Listwise ranking losses optimize over an entire list of candidates for a query, rather than scoring each document independently (pointwise) or comparing pairs (pairwise). They directly optimize ranking metrics like NDCG or provide a tighter bound on the final ranking quality. The trade-off: more computationally expensive than pointwise losses, require seeing all candidates at once, but produce better-calibrated ranking models.

[illustrate: Pointwise: score each doc independently; pairwise: compare (doc_i, doc_j); listwise: optimize over full ranked list jointly]

The three approaches

Pointwise

Score each (query, document) pair independently
Loss: cross-entropy(score, binary_label)
Examples: MonoBERT, MonoT5
Ignores relative ordering between documents

Pairwise

Compare (positive, negative) pairs
Loss: margin loss or hinge loss on score(q, d+) - score(q, d-)
Examples: RankNet, LambdaRank
Ignores global list structure

Listwise

Score all candidates jointly
Loss: directly approximates ranking metric
Examples: ListNet, LambdaLoss, ApproxNDCG, SoftmaxCE

Key listwise losses

SoftmaxCE / InfoNCE: treat the positive as the class to predict among all candidates. Used in contrastive retrieval training.

ListNet: convert scores and labels to probability distributions via softmax, minimize KL divergence.

ApproxNDCG (Qin et al., 2010): differentiable approximation to NDCG using a sigmoid approximation to the indicator function.

LambdaLoss (Wang et al., 2018): derives loss weights from the expected gain in NDCG from a swap. The most theoretically principled approach; used in commercial search ranking.

# ApproxNDCG intuition
def approx_ndcg_loss(scores, labels):
    # scores: [k] model scores for k candidates
    # labels: [k] graded relevance labels

    # Approximate rank using sigmoid: how many docs score above doc i?
    approx_ranks = 1 + torch.sum(
        torch.sigmoid((scores.unsqueeze(1) - scores.unsqueeze(0)) / sigma),
        dim=1
    )
    # Compute NDCG with approximate ranks, negate for gradient descent
    gain = (2**labels - 1) / torch.log2(approx_ranks + 1)
    return -gain.sum() / ideal_dcg(labels)

Variants and history

Pairwise and listwise losses were core to the learning-to-rank literature (2005–2010: RankNet, LambdaRank, LambdaMART). Applied to neural reranking in the transformer era: RankT5 optimizes a listwise softmax; LambdaLoss is used in Google’s production ranking. The InfoNCE loss used in contrastive retrieval is technically listwise (optimizes over a list of negatives).

When to use it

Pointwise (MonoBERT style): simplest, works well with enough data
Pairwise (RankNet style): good when comparative judgments are available
Listwise (LambdaLoss, ApproxNDCG): use when NDCG is the evaluation metric and you have graded relevance labels; requires all candidates in memory simultaneously