Sparse Retrieval

What it is

Sparse retrieval matches queries against documents using an inverted index that maps terms to their occurrences. Documents are ranked using statistical scoring functions (BM25, TF-IDF) based on term frequencies and collection statistics. Sparse methods are interpretable, fast, and effective for keyword-heavy queries.

[illustrate: Inverted index structure with term → postings lists; BM25 scoring formula; query matching flow]

How it works

  1. Index construction:

    • Tokenize and normalize documents
    • Build inverted index: term → [doc_id, positions, frequencies]
    • Pre-compute document statistics (length, term counts)
  2. Query processing:

    • Tokenize and normalize query
    • Look up each query term in the inverted index
    • Retrieve posting lists and merge to find candidate documents
  3. Scoring:

    • TF-IDF: score = Σ_term tf(term, doc) × idf(term)
    • BM25: score = Σ_term idf(term) × (tf(term, doc) × (k1 + 1)) / (tf(term, doc) + k1 × (1 - b + b × (|doc| / avgdoclen)))

Example

Query: "machine learning algorithms"
Index lookup:
  "machine" → [doc_1, doc_5, doc_12, ...]
  "learning" → [doc_1, doc_3, doc_7, ...]
  "algorithms" → [doc_1, doc_2, doc_8, ...]

Candidate: doc_1 (matches all three terms)
BM25(doc_1) = idf("machine") × ... + idf("learning") × ... + idf("algorithms") × ...

Variants and history

Inverted indexes are the foundation of modern IR, dating to the 1960s–70s. TF-IDF emerged as a standard scoring function in the 1980s. Okapi BM25 (Robertson & Zaragoza, 1994) became the industry benchmark, proven difficult to beat. Variants include BM25L (length normalization tuning), BM25+ (length penalty improvement), and field-weighted scoring (BM25F). Modern sparse methods include SPLADE (learned sparse embeddings) that combine interpretability with neural learning.

When to use it

Choose sparse retrieval when:

  • Exact keyword matching is important
  • Query intent is unambiguous
  • You need maximum speed and minimal resources
  • Interpretability is critical (human-readable term matches)
  • Working with structured or domain-specific language
  • Combined with dense methods in hybrid search

Sparse methods struggle with paraphrases, synonyms, and semantic mismatch but excel at exact retrieval, scalability, and transparency.

See also