REALM

What it is

REALM (Retrieval-Augmented Language Model Pretraining, Guu et al., 2020) was the first model to jointly train a neural retriever and a language model end-to-end during pretraining. During masked language modeling, REALM retrieves relevant documents and marginalizes over them to predict masked tokens. The retriever learns to find documents that help predict masked words — a weak but plentiful supervision signal derived entirely from unlabeled text.

[illustrate: Masked sentence → retriever fetches top-k docs → reader marginalizes over docs → predicts masked token; gradient flows through retriever]

How it works

  1. Retrieval step:

    • Input: masked sentence (context)
    • Retrieve top-k documents using a bi-encoder (query = context)
    • Documents provide background knowledge for predicting masked tokens
  2. Marginalization:

    • P(masked_token | context) = Σ_z P(masked_token | context, z) · P(z | context)
    • z = retrieved document; marginalize over top-k documents
    • This makes the retriever differentiable end-to-end
  3. Asynchronous index refresh:

    • Document embeddings are updated asynchronously (like ANCE)
    • Full re-encoding every few hundred steps
  4. Fine-tuning for QA:

    • REALM-pretrained model fine-tuned on open-domain QA
    • Retriever and reader both updated

Variants and history

REALM (2020, Google) was a landmark paper demonstrating that retrieval can be incorporated into pretraining. It directly influenced RAG (Lewis et al., 2020) and Atlas (2022). REALM’s joint training approach is expensive but showed that the retriever quality is dramatically better when trained end-to-end rather than independently. Most production RAG systems use independently trained retrievers for simplicity.

When to use it

REALM is primarily a research reference. For production:

  • Use DPR or Contriever as the retrieval component
  • Use the standard RAG or FiD architecture for reader
  • REALM’s joint training is too expensive for most teams without dedicated pre-training infrastructure

See also