Retrieval-Augmented Generation

What it is

Retrieval-Augmented Generation (RAG) combines a retrieval system (dense or sparse) with a language model. Given a query, relevant documents are retrieved and provided as context to the model, which then generates answers grounded in the retrieved information. RAG reduces hallucination, enables use of private/proprietary knowledge, and simplifies knowledge updates.

[illustrate: Query → retrieval system → top-k documents → concatenate with query → LLM → grounded answer]

How it works

  1. Offline indexing:

    • Index documents using retriever (dense, sparse, or hybrid)
    • Store document passages or full text
  2. Retrieval:

    • Encode query with same retriever
    • Retrieve top-k relevant documents
  3. Generation:

    • Format prompt: “Context: [retrieved docs]\n\nQuestion: [query]\nAnswer:”
    • Feed to LLM (BERT, GPT, T5, etc.)
    • Model generates answer conditioned on retrieved context
  4. Output: Grounded, more factual answer

Example

Query: "What is the capital of France?"

Retrieval: Search indexes for France-related docs
Retrieved docs:
  [1] "France is a country in Europe. Paris is its capital..."
  [2] "Paris is known for the Eiffel Tower..."

Generation prompt:
"Context: France is a country in Europe. Paris is its capital..."
"Question: What is the capital of France?"
"Answer: "

Model outputs: "Paris" (grounded in retrieved doc)

Variants and history

RAG appeared around 2020 (Lewis et al., FAIR). DPR + BART combined dense retrieval with generation. Fusion-in-Decoder processes retrieved documents separately then fuses. Recursive retrieval iteratively retrieves based on model-generated intermediate answers. Long-context models (100k tokens) reduce need for retrieval but RAG remains useful for knowledge separation and updates. Industry standard for production QA and chatbot systems.

When to use it

Use RAG when:

  • Factuality and grounding are critical
  • Knowledge base changes frequently
  • Hallucination tolerance is low
  • You want interpretable reasoning (see what was retrieved)
  • Domain-specific knowledge is not in pre-training
  • Reducing model size is beneficial (smaller model + large retriever)

RAG adds retrieval latency (~50–200ms for dense retrieval) but dramatically improves reliability. Trade-off: slightly slower than direct LLM generation but far more trustworthy.

See also