HyDE
What it is
HyDE (Hypothetical Document Embeddings, Gao et al., 2022) improves zero-shot dense retrieval by reframing the query-side representation. Instead of embedding the short query directly, an LLM generates a hypothetical document that would answer the query. This hypothetical document — even if factually incorrect — tends to share vocabulary, style, and structure with real relevant documents, making it a better query representation for embedding-based retrieval.
[illustrate: Query → LLM generates hypothetical answer → embed hypothetical → ANN search in document space; hypothetical document closer to relevant docs than short query]
How it works
-
Hypothetical document generation:
- Prompt an LLM: “Write a passage that answers: {query}”
- Generate one or more hypothetical documents (typically 8)
-
Embedding:
- Encode the hypothetical document(s) with a dense retrieval encoder
- Average multiple hypotheticals to reduce noise
-
Retrieval:
- ANN search using the hypothetical document embedding as the query vector
- Retrieve top-k real documents
-
Why it works:
- Short queries are sparse in embedding space; dense answers are richer
- The embedding model was trained on document-document similarity; HyDE brings the query into the “document distribution”
- Factual errors in the hypothetical don’t matter much — topical vocabulary is what drives retrieval
Example
Query: "What causes aurora borealis?"
LLM generates:
"Aurora borealis occurs when charged particles from the sun
interact with Earth's magnetic field, exciting atmospheric
gases like oxygen and nitrogen to emit colored light..."
Embed this generated passage → retrieve real documents about
solar wind, magnetosphere, atmospheric physics
Variants and history
HyDE (2022) showed strong zero-shot improvements on BEIR with no dense retrieval fine-tuning. Query2Doc (2023) used a similar idea with the generated text concatenated to the original query rather than replacing it. Step-Back Prompting extends the idea to more abstract question rephrasing. HyDE is widely used in RAG pipelines as a query preprocessing step.
When to use it
Use HyDE when:
- No labeled retrieval training data is available
- Zero-shot retrieval performance needs improvement
- An LLM is already available in the pipeline
- The query is short or ambiguous (HyDE helps most in these cases)
Not useful when: the LLM generates hallucinated entities that mislead the embedding model, or latency from LLM generation is unacceptable.