BM25

What it is

BM25 — short for Best Match 25 — is a bag-of-words ranking function used by search engines to score how relevant a document is to a query. It is the default scoring model in Elasticsearch, OpenSearch, Solr (via Lucene), and Tantivy. When you type a query and results come back ordered by relevance, BM25 is most likely doing the ranking.

The “25” reflects its place in a lineage of probabilistic retrieval models developed at the University of London during the 1970s–90s; the earlier iterations are rarely used today.

How it works

BM25 scores a document d against a query q by summing a per-term contribution for each query term that appears in the document:

score(d, q) = Σ IDF(tᵢ) · (tf(tᵢ, d) · (k₁ + 1)) / (tf(tᵢ, d) + k₁ · (1 - b + b · |d| / avgdl))

Three things are happening at once:

Inverse document frequency (IDF) measures how rare the term is across the entire corpus. A term that appears in every document (e.g. “the”) contributes almost nothing; a term that appears in one document (e.g. “eigenvalue”) contributes a lot.

Term frequency saturation controls how much credit a document earns for repeated occurrences of the same term. Raw TF-IDF rewards documents that repeat a term endlessly; BM25 caps this growth with the k₁ parameter (typical values: 1.2–2.0). Beyond a certain count, extra repetitions barely increase the score.

Length normalisation penalises long documents. A 10 000-word article that mentions your query term once should not outscore a 100-word paragraph that is densely about it. The b parameter (0–1, default 0.75) controls the strength of this penalty, and avgdl is the mean document length across the index.

[illustrate: two documents scoring against a three-term query — stacked bar segments per document showing IDF weight, TF saturation contribution, and length normalisation penalty for each term, with final scores labelled]

Example

Corpus of three documents:

ID Text
D1 “the dog sat on the mat”
D2 “the cat sat on the mat”
D3 “the dog chased the cat”

Query: “dog sat”

  • “dog” appears in D1 and D3 → moderate IDF
  • “sat” appears in D1 and D2 → moderate IDF
  • D1 contains both terms → highest combined score
  • D3 contains “dog” but not “sat” → lower score
  • D2 contains “sat” but not “dog” → lower score, similar to D3

D1 ranks first. Length normalisation has no dramatic effect here because all documents are similar in length — but against a much longer D3 the penalty would widen the gap.

[illustrate: step-by-step BM25 calculation for D1 against query “dog sat” — show IDF values, TF saturation curve with k₁ annotated, and the final score as a sum of per-term contributions]

Variants and history

BM25 was formalised by Stephen Robertson and colleagues at City University London and Okapi research project in the 1990s. The full name Okapi BM25 honours the Okapi IR system it was first implemented in.

Common variants:

  • BM25F — extends BM25 to multi-field documents (title, body, metadata), weighting each field separately before combining scores.
  • BM25+ — adds a small floor to the term frequency contribution so that terms appearing at least once always receive a nonzero score, correcting an edge-case under-ranking bug in the original formula.
  • BM25L — an alternative length normalisation designed to be less aggressive than the original b penalty for very long documents.

Modern neural retrieval models (dense retrievers, ColBERT) often outperform BM25 on semantic queries, but BM25 remains competitive on keyword queries, requires no training data, and runs at index speed — making it the practical baseline in most production systems.

When to use it

Use BM25 as the default first-pass ranker in any keyword or full-text search system. It outperforms raw TF-IDF for nearly every workload and is supported out of the box by all major search platforms.

Tune k₁ upward (towards 2.0) when your documents vary greatly in verbosity and repetition rewards should be higher. Lower b (towards 0.0) when document length is not a meaningful signal — for example, when indexing short social media posts of roughly equal length.

Consider moving beyond BM25 when:

  • Queries are semantic or paraphrastic (“affordable accommodation” should match “cheap hotel”) — use dense retrieval or hybrid search instead.
  • Your corpus is multilingual and term overlap is low.
  • You need learning-to-rank (LTR) with behavioural signals; BM25 makes an excellent first-stage feature.

See also