BM25F

What it is

BM25F (BM25 with extension to multiple weighted Fields) is a variant of BM25 designed for documents that have multiple distinct fields — for example, a web page with a title, a URL, headings, and body text. Standard BM25 treats a document as a single bag of words; BM25F applies separate TF saturation and length normalisation to each field before combining them into a single pseudo-document score.

How it works

In standard BM25, each term’s TF contribution is computed over the whole document. BM25F instead computes a pseudo-TF for each term by summing weighted per-field TF values, then feeds this combined value into the standard BM25 saturation formula.

For a document with fields f₁, f₂, ..., fₙ:

pseudo_tf(t, d) = Σ wᶠ · tf(t, f) / (1 - b_f + b_f · (|f| / avgfl_f))

Where:

  • wᶠ — boost weight for field f (e.g. title weight = 3, body weight = 1)
  • b_f — length normalisation parameter for field f
  • |f| — length of the field in tokens
  • avgfl_f — average length of field f across the corpus

This pseudo-TF then slots into the standard BM25 IDF × TF formula, producing a single relevance score per document.

[illustrate: a document split into title field and body field — per-field TF and length normalisation computed separately with their own b_f parameters — then weighted and summed into a single pseudo-TF, which feeds into the standard BM25 formula alongside the IDF value]

Example

Document with two fields:

  • Title: "quick fox" (length 2)
  • Body: "the quick brown fox jumped over the lazy dog" (length 9)

Field weights: w_title = 3.0, w_body = 1.0

For query term "fox":

  • tf_title("fox") = 1 / (1 - 0.75 + 0.75 · (2/5)) = 1 / 0.55 ≈ 1.82 (normalised by title avg length 5)
  • tf_body("fox") = 1 / (1 - 0.75 + 0.75 · (9/20)) = 1 / 0.59 ≈ 1.69 (normalised by body avg length 20)
  • pseudo_tf = 3.0 · 1.82 + 1.0 · 1.69 ≈ 7.15

The high title weight means a title match contributes far more than a body match of equal raw frequency.

Variants and history

BM25F was introduced by Robertson, Zaragoza, and Taylor in their 2004 paper “Simple BM25 Extension to Multiple Weighted Fields.” It was validated on the TREC Web track and became the standard approach for multi-field web search ranking.

In practice, most production search engines implement a simpler approximation: boosting individual field scores and summing them at query time rather than combining at the TF level. This is less theoretically correct than true BM25F but easier to configure and often adequate.

True BM25F is available in some engines:

  • Elasticsearch/OpenSearch support per-field boost parameters in multi_match queries, which approximate (but do not exactly implement) BM25F.
  • Lucene’s SimilarityBase can be extended to implement strict BM25F.

When to use it

Use BM25F (or its approximation) whenever you have multi-field documents where field position matters. Common scenarios:

  • Web search — title and URL matches are stronger signals than body matches.
  • E-commerce — product name matches should outweigh description matches.
  • Document search — heading and abstract matches outweigh footnote matches.

Tune b_f independently per field: set b_title = 0 (or low) since title length variation is not a quality signal; set b_body = 0.75 (default) to normalise long body text.

See also