LLM Rerankers (RankGPT)

What it is

LLM rerankers (RankGPT, Sun et al., 2023) use large language models like GPT-4 or open-source alternatives to rerank retrieval candidates in a zero-shot setting. The LLM receives the query and a numbered list of candidate passages, and is prompted to output the indices in relevance order. No fine-tuning or relevance labels are required — the LLM’s general language understanding provides the ranking signal.

[illustrate: Query + numbered passages → LLM prompt → ordered list of passage IDs as output]

How it works

  1. Prompt format (listwise):

    I will provide you with {k} passages, each indicated by a number.
    Rank the passages based on their relevance to the query: {query}
    
    [1] {passage_1}
    [2] {passage_2}
    ...
    [k] {passage_k}
    
    The passages should be ranked from most to least relevant.
    Output: a permutation of [1] through [{k}].
    
  2. Sliding window for large candidate sets:

    • LLM context limits to ~20 passages per window
    • Slide window from bottom to top of initial ranking with 50% overlap
    • Passages “bubble up” to their correct position across passes
  3. Score extraction:

    • Parse the output permutation
    • Map back to original passage IDs
    • Handle malformed outputs with fallback to original order
  4. Pointwise variant:

    • Query each passage independently: “Is this passage relevant to the query? Yes/No”
    • Use log P(“Yes”) as the score
    • More robust to long context but loses pairwise comparisons

Variants and history

RankGPT (2023) showed that GPT-4 zero-shot reranking matches or exceeds fine-tuned MonoT5-3B on TREC DL and BEIR. RankVicuna and RankZephyr applied the same approach to open-source 7B models. PRP (Pairwise Ranking Prompting) uses an all-pairs comparison approach. LRL (LLM as a Reranker with Listwise) studied prompt sensitivity. LLM rerankers are now commonly used as the second or third stage in production pipelines.

When to use it

Use LLM rerankers when:

  • No labeled reranking data is available for fine-tuning
  • Highest possible reranking quality is needed and latency allows LLM inference
  • You already have an LLM API in your infrastructure
  • You want to test reranking quality before committing to a fine-tuned model

Not suitable when: latency is critical (LLM inference is slow), cost is constrained (API calls per query), or the candidate set is very large without a preceding filtering stage.

See also