Retrieval-Augmented Generation

Rag Retrieval-Augmented-Generation Generation Grounding Dense-Retrieval Needs-Review

What it is

Retrieval-Augmented Generation (RAG) combines a retrieval system (dense or sparse) with a language model. Given a query, relevant documents are retrieved and provided as context to the model, which then generates answers grounded in the retrieved information. RAG reduces hallucination, enables use of private/proprietary knowledge, and simplifies knowledge updates.

[illustrate: Query → retrieval system → top-k documents → concatenate with query → LLM → grounded answer]

How it works

Offline indexing:
- Index documents using retriever (dense, sparse, or hybrid)
- Store document passages or full text
Retrieval:
- Encode query with same retriever
- Retrieve top-k relevant documents
Generation:
- Format prompt: “Context: [retrieved docs]\n\nQuestion: [query]\nAnswer:”
- Feed to LLM (BERT, GPT, T5, etc.)
- Model generates answer conditioned on retrieved context
Output: Grounded, more factual answer

Example

Query: "What is the capital of France?"

Retrieval: Search indexes for France-related docs
Retrieved docs:
  [1] "France is a country in Europe. Paris is its capital..."
  [2] "Paris is known for the Eiffel Tower..."

Generation prompt:
"Context: France is a country in Europe. Paris is its capital..."
"Question: What is the capital of France?"
"Answer: "

Model outputs: "Paris" (grounded in retrieved doc)

Variants and history

RAG appeared around 2020 (Lewis et al., FAIR). DPR + BART combined dense retrieval with generation. Fusion-in-Decoder processes retrieved documents separately then fuses. Recursive retrieval iteratively retrieves based on model-generated intermediate answers. Long-context models (100k tokens) reduce need for retrieval but RAG remains useful for knowledge separation and updates. Industry standard for production QA and chatbot systems.

When to use it

Use RAG when:

Factuality and grounding are critical
Knowledge base changes frequently
Hallucination tolerance is low
You want interpretable reasoning (see what was retrieved)
Domain-specific knowledge is not in pre-training
Reducing model size is beneficial (smaller model + large retriever)

RAG adds retrieval latency (~50–200ms for dense retrieval) but dramatically improves reliability. Trade-off: slightly slower than direct LLM generation but far more trustworthy.