Boolean Retrieval
What it is
Boolean retrieval is the simplest and oldest information retrieval model. A query is expressed as a Boolean expression over terms — "cat AND mat", "dog OR fox", "fox NOT dog" — and the system returns every document that satisfies the expression exactly. There is no scoring, no ranking, and no partial matching: a document either matches or it does not.
Boolean retrieval was the dominant model in library and database systems from the 1950s through the 1980s and remains the foundation of all modern search engine query evaluation, even when ranking is layered on top.
How it works
Each Boolean operator maps to a postings list operation:
AND — intersect postings lists. Only documents in both lists are returned.
"cat AND mat" → intersect(postings("cat"), postings("mat"))
OR — union postings lists. Documents in either list are returned.
"cat OR mat" → union(postings("cat"), postings("mat"))
NOT — complement. Documents in the first list but not the second.
"cat NOT mat" → postings("cat") \ postings("mat")
Because postings lists are sorted by document ID, AND and OR can be computed in O(n + m) time using a two-pointer merge — where n and m are the lengths of the two lists. NOT requires the complement relative to the full document set, which is expensive for common terms; engines typically push NOT inside AND (A AND NOT B) to prune first.
[illustrate: Venn diagram of two postings lists for “cat” [D1, D3, D5, D7] and “mat” [D3, D5, D8] — AND intersection shaded (D3, D5), OR union shaded (D1, D3, D5, D7, D8), NOT cat shaded as the complement]
Example
Corpus of 5 documents:
| Doc | Contains “cat” | Contains “mat” | Contains “sat” |
|---|---|---|---|
| D1 | ✓ | ✓ | |
| D2 | ✓ | ||
| D3 | ✓ | ✓ | |
| D4 | ✓ | ||
| D5 | ✓ | ✓ | ✓ |
| Query | Result |
|---|---|
cat AND mat |
D3, D5 |
cat OR mat |
D1, D2, D3, D5 |
cat AND NOT mat |
D1 |
(cat OR mat) AND sat |
D1, D5 |
Variants and history
Boolean retrieval originates in the Boolean algebra of George Boole (1854) and was adapted for library card indexes in the 1950s. Early online bibliographic systems — MEDLINE, DIALOG — used Boolean queries as their primary interface.
The model’s limitations drove the development of ranked retrieval:
- No ranking — all results are equivalent; useful when the set is small, frustrating when thousands of documents match.
- Exact term matching — no tolerance for synonyms, spelling variants, or relevance degrees.
- Difficult to formulate — users must think in formal logic rather than natural language.
Modern search engines use Boolean logic as a structural layer (must/should/must-not in Elasticsearch) but apply BM25 scoring within the matching set. The Boolean layer constrains which documents are candidates; the scorer ranks them.
When to use it
Use Boolean retrieval when:
- Precision is paramount over recall — legal document review, patent search, compliance filtering where every result must meet a hard criterion.
- Programmatic querying — applications generating structured queries know exactly which documents they need; ranking adds no value.
- Filter queries — in Elasticsearch/OpenSearch,
filterclauses use Boolean matching (cached and unscored) for efficient pre-filtering before ranked scoring.
Avoid pure Boolean retrieval for user-facing search: the lack of ranking and tolerance for approximate matches produces poor user experience on large corpora.