LLM Rerankers (RankGPT)
What it is
LLM rerankers (RankGPT, Sun et al., 2023) use large language models like GPT-4 or open-source alternatives to rerank retrieval candidates in a zero-shot setting. The LLM receives the query and a numbered list of candidate passages, and is prompted to output the indices in relevance order. No fine-tuning or relevance labels are required — the LLM’s general language understanding provides the ranking signal.
[illustrate: Query + numbered passages → LLM prompt → ordered list of passage IDs as output]
How it works
-
Prompt format (listwise):
I will provide you with {k} passages, each indicated by a number. Rank the passages based on their relevance to the query: {query} [1] {passage_1} [2] {passage_2} ... [k] {passage_k} The passages should be ranked from most to least relevant. Output: a permutation of [1] through [{k}]. -
Sliding window for large candidate sets:
- LLM context limits to ~20 passages per window
- Slide window from bottom to top of initial ranking with 50% overlap
- Passages “bubble up” to their correct position across passes
-
Score extraction:
- Parse the output permutation
- Map back to original passage IDs
- Handle malformed outputs with fallback to original order
-
Pointwise variant:
- Query each passage independently: “Is this passage relevant to the query? Yes/No”
- Use log P(“Yes”) as the score
- More robust to long context but loses pairwise comparisons
Variants and history
RankGPT (2023) showed that GPT-4 zero-shot reranking matches or exceeds fine-tuned MonoT5-3B on TREC DL and BEIR. RankVicuna and RankZephyr applied the same approach to open-source 7B models. PRP (Pairwise Ranking Prompting) uses an all-pairs comparison approach. LRL (LLM as a Reranker with Listwise) studied prompt sensitivity. LLM rerankers are now commonly used as the second or third stage in production pipelines.
When to use it
Use LLM rerankers when:
- No labeled reranking data is available for fine-tuning
- Highest possible reranking quality is needed and latency allows LLM inference
- You already have an LLM API in your infrastructure
- You want to test reranking quality before committing to a fine-tuned model
Not suitable when: latency is critical (LLM inference is slow), cost is constrained (API calls per query), or the candidate set is very large without a preceding filtering stage.