HyDE

Hyde Hypothetical-Document Dense-Retrieval Query-Expansion Neural-Ir Needs-Review

What it is

HyDE (Hypothetical Document Embeddings, Gao et al., 2022) improves zero-shot dense retrieval by reframing the query-side representation. Instead of embedding the short query directly, an LLM generates a hypothetical document that would answer the query. This hypothetical document — even if factually incorrect — tends to share vocabulary, style, and structure with real relevant documents, making it a better query representation for embedding-based retrieval.

[illustrate: Query → LLM generates hypothetical answer → embed hypothetical → ANN search in document space; hypothetical document closer to relevant docs than short query]

How it works

Hypothetical document generation:
- Prompt an LLM: “Write a passage that answers: {query}”
- Generate one or more hypothetical documents (typically 8)
Embedding:
- Encode the hypothetical document(s) with a dense retrieval encoder
- Average multiple hypotheticals to reduce noise
Retrieval:
- ANN search using the hypothetical document embedding as the query vector
- Retrieve top-k real documents
Why it works:
- Short queries are sparse in embedding space; dense answers are richer
- The embedding model was trained on document-document similarity; HyDE brings the query into the “document distribution”
- Factual errors in the hypothetical don’t matter much — topical vocabulary is what drives retrieval

Example

Query: "What causes aurora borealis?"

LLM generates:
  "Aurora borealis occurs when charged particles from the sun
   interact with Earth's magnetic field, exciting atmospheric
   gases like oxygen and nitrogen to emit colored light..."

Embed this generated passage → retrieve real documents about
  solar wind, magnetosphere, atmospheric physics

Variants and history

HyDE (2022) showed strong zero-shot improvements on BEIR with no dense retrieval fine-tuning. Query2Doc (2023) used a similar idea with the generated text concatenated to the original query rather than replacing it. Step-Back Prompting extends the idea to more abstract question rephrasing. HyDE is widely used in RAG pipelines as a query preprocessing step.

When to use it

Use HyDE when:

No labeled retrieval training data is available
Zero-shot retrieval performance needs improvement
An LLM is already available in the pipeline
The query is short or ambiguous (HyDE helps most in these cases)

Not useful when: the LLM generates hallucinated entities that mislead the embedding model, or latency from LLM generation is unacceptable.