Term Frequency

What it is

Term frequency (TF) is the raw count of how many times a query term appears in a document. It is the most direct signal that a document is about a given topic: a document mentioning “protein” forty times is likely more relevant to a protein query than one mentioning it once.

TF is one of the two fundamental signals in classical term-weighting schemes. The other is inverse document frequency (IDF), which captures how distinctive a term is across the corpus.

How it works

In its raw form, TF is simply:

tf(t, d) = count of occurrences of term t in document d

Raw TF has a known problem: it grows without bound. A document repeating a query term a hundred times should not rank a hundred times higher than one repeating it once — at some point, extra repetitions add little information. Several normalised variants address this:

Logarithmic TF:

tf_log(t, d) = 1 + log(tf(t, d))   if tf > 0, else 0

Dampens large counts; a document with tf=100 scores 1 + log(100) = 3, not 100.

Boolean TF:

tf_bool(t, d) = 1 if tf > 0, else 0

Ignores frequency entirely; useful when presence matters more than count.

BM25 saturation — BM25 uses a more sophisticated saturation function controlled by the k₁ parameter (see the BM25 citation). This is the standard in modern search.

[illustrate: TF saturation curve — x-axis is raw term count (0 to 20), y-axis is TF contribution — three lines: raw TF (linear, steep), log TF (curves upward, flattening), BM25 saturation with k₁=1.2 (S-curve flattening at ~4) — showing diminishing returns as count increases]

Example

Document D1: "The cat sat on the mat. The cat was fat."

After tokenisation and lowercasing (stop words retained for clarity):

Term Raw TF
the 3
cat 2
sat 1
on 1
mat 1
was 1
fat 1

For a query "cat mat", D1’s TF values are cat=2, mat=1.

Variants and history

TF was formalised as part of the vector space model by Gerard Salton in the 1970s. The TF×IDF combination was described by Sparck Jones (1972) for IDF and Salton & Buckley (1988) for the combined scheme.

Augmented TF normalises by the maximum TF in the document to reduce the advantage of longer documents:

tf_aug(t, d) = 0.5 + 0.5 × (tf(t, d) / max_tf(d))

BM25 effectively replaces augmented TF with its saturation formula, plus an additional document-length penalty. This is why BM25 outperforms raw TF-IDF for most workloads.

When to use it

TF on its own is rarely used directly. It is always combined with IDF (in TF-IDF) or folded into BM25 scoring. The practical decisions are:

  • Use BM25 (the default in Elasticsearch, OpenSearch, Solr) — its TF saturation is superior to log or raw TF.
  • Disable TF with index_options: docs in Elasticsearch when field-level scoring is not needed (e.g. filter-only fields). This reduces index size and speeds up boolean queries.
  • Inspect TF via the explain API to debug unexpected rankings — unexpectedly high TF values in a short document often indicate an analysis chain that is tokenising too aggressively.

See also