Sparse Retrieval
What it is
Sparse retrieval matches queries against documents using an inverted index that maps terms to their occurrences. Documents are ranked using statistical scoring functions (BM25, TF-IDF) based on term frequencies and collection statistics. Sparse methods are interpretable, fast, and effective for keyword-heavy queries.
[illustrate: Inverted index structure with term → postings lists; BM25 scoring formula; query matching flow]
How it works
-
Index construction:
- Tokenize and normalize documents
- Build inverted index: term → [doc_id, positions, frequencies]
- Pre-compute document statistics (length, term counts)
-
Query processing:
- Tokenize and normalize query
- Look up each query term in the inverted index
- Retrieve posting lists and merge to find candidate documents
-
Scoring:
- TF-IDF:
score = Σ_term tf(term, doc) × idf(term) - BM25:
score = Σ_term idf(term) × (tf(term, doc) × (k1 + 1)) / (tf(term, doc) + k1 × (1 - b + b × (|doc| / avgdoclen)))
- TF-IDF:
Example
Query: "machine learning algorithms"
Index lookup:
"machine" → [doc_1, doc_5, doc_12, ...]
"learning" → [doc_1, doc_3, doc_7, ...]
"algorithms" → [doc_1, doc_2, doc_8, ...]
Candidate: doc_1 (matches all three terms)
BM25(doc_1) = idf("machine") × ... + idf("learning") × ... + idf("algorithms") × ...
Variants and history
Inverted indexes are the foundation of modern IR, dating to the 1960s–70s. TF-IDF emerged as a standard scoring function in the 1980s. Okapi BM25 (Robertson & Zaragoza, 1994) became the industry benchmark, proven difficult to beat. Variants include BM25L (length normalization tuning), BM25+ (length penalty improvement), and field-weighted scoring (BM25F). Modern sparse methods include SPLADE (learned sparse embeddings) that combine interpretability with neural learning.
When to use it
Choose sparse retrieval when:
- Exact keyword matching is important
- Query intent is unambiguous
- You need maximum speed and minimal resources
- Interpretability is critical (human-readable term matches)
- Working with structured or domain-specific language
- Combined with dense methods in hybrid search
Sparse methods struggle with paraphrases, synonyms, and semantic mismatch but excel at exact retrieval, scalability, and transparency.