Sparse Retrieval

Retrieval Sparse Inverted-Index Bm25 Tf-Idf Term-Matching Needs-Review

What it is

Sparse retrieval matches queries against documents using an inverted index that maps terms to their occurrences. Documents are ranked using statistical scoring functions (BM25, TF-IDF) based on term frequencies and collection statistics. Sparse methods are interpretable, fast, and effective for keyword-heavy queries.

[illustrate: Inverted index structure with term → postings lists; BM25 scoring formula; query matching flow]

How it works

Index construction:
- Tokenize and normalize documents
- Build inverted index: term → [doc_id, positions, frequencies]
- Pre-compute document statistics (length, term counts)
Query processing:
- Tokenize and normalize query
- Look up each query term in the inverted index
- Retrieve posting lists and merge to find candidate documents
Scoring:
- TF-IDF: score = Σ_term tf(term, doc) × idf(term)
- BM25: score = Σ_term idf(term) × (tf(term, doc) × (k1 + 1)) / (tf(term, doc) + k1 × (1 - b + b × (|doc| / avgdoclen)))

Example

Query: "machine learning algorithms"
Index lookup:
  "machine" → [doc_1, doc_5, doc_12, ...]
  "learning" → [doc_1, doc_3, doc_7, ...]
  "algorithms" → [doc_1, doc_2, doc_8, ...]

Candidate: doc_1 (matches all three terms)
BM25(doc_1) = idf("machine") × ... + idf("learning") × ... + idf("algorithms") × ...

Variants and history

Inverted indexes are the foundation of modern IR, dating to the 1960s–70s. TF-IDF emerged as a standard scoring function in the 1980s. Okapi BM25 (Robertson & Zaragoza, 1994) became the industry benchmark, proven difficult to beat. Variants include BM25L (length normalization tuning), BM25+ (length penalty improvement), and field-weighted scoring (BM25F). Modern sparse methods include SPLADE (learned sparse embeddings) that combine interpretability with neural learning.

When to use it

Choose sparse retrieval when:

Exact keyword matching is important
Query intent is unambiguous
You need maximum speed and minimal resources
Interpretability is critical (human-readable term matches)
Working with structured or domain-specific language
Combined with dense methods in hybrid search

Sparse methods struggle with paraphrases, synonyms, and semantic mismatch but excel at exact retrieval, scalability, and transparency.