LLM Rerankers (RankGPT)

Rankgpt Llm-Reranker Reranking Listwise-Ranking Neural-Ir Needs-Review

What it is

LLM rerankers (RankGPT, Sun et al., 2023) use large language models like GPT-4 or open-source alternatives to rerank retrieval candidates in a zero-shot setting. The LLM receives the query and a numbered list of candidate passages, and is prompted to output the indices in relevance order. No fine-tuning or relevance labels are required — the LLM’s general language understanding provides the ranking signal.

[illustrate: Query + numbered passages → LLM prompt → ordered list of passage IDs as output]

How it works

Prompt format (listwise):

I will provide you with {k} passages, each indicated by a number.
Rank the passages based on their relevance to the query: {query}

[1] {passage_1}
[2] {passage_2}
...
[k] {passage_k}

The passages should be ranked from most to least relevant.
Output: a permutation of [1] through [{k}].

Sliding window for large candidate sets:
- LLM context limits to ~20 passages per window
- Slide window from bottom to top of initial ranking with 50% overlap
- Passages “bubble up” to their correct position across passes
Score extraction:
- Parse the output permutation
- Map back to original passage IDs
- Handle malformed outputs with fallback to original order
Pointwise variant:
- Query each passage independently: “Is this passage relevant to the query? Yes/No”
- Use log P(“Yes”) as the score
- More robust to long context but loses pairwise comparisons

Variants and history

RankGPT (2023) showed that GPT-4 zero-shot reranking matches or exceeds fine-tuned MonoT5-3B on TREC DL and BEIR. RankVicuna and RankZephyr applied the same approach to open-source 7B models. PRP (Pairwise Ranking Prompting) uses an all-pairs comparison approach. LRL (LLM as a Reranker with Listwise) studied prompt sensitivity. LLM rerankers are now commonly used as the second or third stage in production pipelines.

When to use it

Use LLM rerankers when:

No labeled reranking data is available for fine-tuning
Highest possible reranking quality is needed and latency allows LLM inference
You already have an LLM API in your infrastructure
You want to test reranking quality before committing to a fine-tuned model

Not suitable when: latency is critical (LLM inference is slow), cost is constrained (API calls per query), or the candidate set is very large without a preceding filtering stage.

LLM Rerankers (RankGPT)

What it is

How it works

Variants and history

When to use it

See also