HyDE

What it is

HyDE (Hypothetical Document Embeddings, Gao et al., 2022) improves zero-shot dense retrieval by reframing the query-side representation. Instead of embedding the short query directly, an LLM generates a hypothetical document that would answer the query. This hypothetical document — even if factually incorrect — tends to share vocabulary, style, and structure with real relevant documents, making it a better query representation for embedding-based retrieval.

[illustrate: Query → LLM generates hypothetical answer → embed hypothetical → ANN search in document space; hypothetical document closer to relevant docs than short query]

How it works

  1. Hypothetical document generation:

    • Prompt an LLM: “Write a passage that answers: {query}”
    • Generate one or more hypothetical documents (typically 8)
  2. Embedding:

    • Encode the hypothetical document(s) with a dense retrieval encoder
    • Average multiple hypotheticals to reduce noise
  3. Retrieval:

    • ANN search using the hypothetical document embedding as the query vector
    • Retrieve top-k real documents
  4. Why it works:

    • Short queries are sparse in embedding space; dense answers are richer
    • The embedding model was trained on document-document similarity; HyDE brings the query into the “document distribution”
    • Factual errors in the hypothetical don’t matter much — topical vocabulary is what drives retrieval

Example

Query: "What causes aurora borealis?"

LLM generates:
  "Aurora borealis occurs when charged particles from the sun
   interact with Earth's magnetic field, exciting atmospheric
   gases like oxygen and nitrogen to emit colored light..."

Embed this generated passage → retrieve real documents about
  solar wind, magnetosphere, atmospheric physics

Variants and history

HyDE (2022) showed strong zero-shot improvements on BEIR with no dense retrieval fine-tuning. Query2Doc (2023) used a similar idea with the generated text concatenated to the original query rather than replacing it. Step-Back Prompting extends the idea to more abstract question rephrasing. HyDE is widely used in RAG pipelines as a query preprocessing step.

When to use it

Use HyDE when:

  • No labeled retrieval training data is available
  • Zero-shot retrieval performance needs improvement
  • An LLM is already available in the pipeline
  • The query is short or ambiguous (HyDE helps most in these cases)

Not useful when: the LLM generates hallucinated entities that mislead the embedding model, or latency from LLM generation is unacceptable.

See also