FiD (Fusion-in-Decoder)

What it is

FiD (Fusion-in-Decoder, Izacard & Grave, 2020) is a reader architecture for open-domain QA that processes each retrieved passage independently through a T5 encoder, then concatenates all encoded representations for the decoder to attend over. This sidesteps the quadratic attention cost of processing all passages together as a single long sequence, enabling effective use of many more retrieved passages (up to 100).

[illustrate: k retrieved passages → k independent T5 encoder calls → concatenated encoder outputs → single T5 decoder generates answer; attention spans all passages]

How it works

  1. Independent encoding:

    • For each of k retrieved passages: prepend query, encode with T5 encoder
    • Each passage produces a sequence of hidden states [L × d]
    • k passages → k × L encoder outputs
  2. Fusion in decoder:

    • Concatenate all k × L hidden states
    • T5 decoder attends over the full concatenated representation
    • Cross-attention spans all passages simultaneously
  3. Generation:

    • Decoder generates the answer autoregressively
    • Naturally aggregates information across passages
  4. Scaling advantage:

    • Encoding is linear in k (k independent passes, can be parallelized)
    • vs. concatenation: quadratic attention over L × k tokens
    • Enables k = 100 passages vs. k ≈ 5–10 for concatenation approaches

Variants and history

FiD (2020) became the standard reader architecture for open-domain QA, substantially outperforming both single-passage readers and full-concatenation approaches. FiD-KD added knowledge distillation between reader and retriever. Atlas (2022) fine-tuned FiD jointly with a retriever and showed strong few-shot performance. The independent encoding pattern is now common in RAG systems where context window limits require chunked processing.

When to use it

Use FiD when:

  • Open-domain QA requires synthesizing information across many passages
  • Context window constraints prevent concatenating all retrieved passages
  • A generative answer is required (not just passage ranking)
  • You have a T5-family model and can parallelize encoder passes

See also