Atlas

What it is

Atlas (Izacard et al., 2022) is a retrieval-augmented language model that combines a FiD reader with a Contriever retriever and jointly fine-tunes both components. The key finding: with retrieval, an 11B parameter Atlas matches GPT-3 (175B) on several knowledge-intensive benchmarks using only 64 training examples. This demonstrated that retrieval augmentation is a powerful substitute for parameter count when knowledge is involved.

[illustrate: Query → Contriever retriever → top-k passages → FiD encoder-decoder → answer; gradient flows through both retriever and reader during joint fine-tuning]

How it works

  1. Retriever: Contriever or mContriever bi-encoder for dense passage retrieval

  2. Reader: FiD (Fusion-in-Decoder) T5 model that processes passages independently and generates answers

  3. Joint fine-tuning:

    • Reader loss: cross-entropy on answer tokens
    • Retriever loss: Attention Distillation — passages attended to heavily by the reader get higher retrieval scores
    • Both components updated jointly via backpropagation
  4. Efficient joint training:

    • Reader attention over passages provides a training signal for the retriever without explicit relevance labels
    • “Perplexity Distillation” variant: use reader perplexity reduction as retriever supervision

Variants and history

Atlas (2022, Meta AI) established that retrieval + small model can compete with giant models on knowledge tasks. The result influenced the broader trend of retrieval-augmented systems as an alternative to scaling language model parameters. RA-DIT (2023) extended joint fine-tuning to instruction-tuned LLMs. The Attention Distillation training signal (teach the retriever using reader attention) is reused in several subsequent papers.

When to use it

Use Atlas as a reference architecture when:

  • You want a jointly trained retriever + reader system
  • Few-shot performance on knowledge-intensive tasks matters
  • You want to see how much retrieval augmentation can substitute for model scale

For production RAG, simpler independently-trained pipelines (DPR + T5 or DPR + LLM) are usually preferred.

See also