Term Frequency
What it is
Term frequency (TF) is the raw count of how many times a query term appears in a document. It is the most direct signal that a document is about a given topic: a document mentioning “protein” forty times is likely more relevant to a protein query than one mentioning it once.
TF is one of the two fundamental signals in classical term-weighting schemes. The other is inverse document frequency (IDF), which captures how distinctive a term is across the corpus.
How it works
In its raw form, TF is simply:
tf(t, d) = count of occurrences of term t in document d
Raw TF has a known problem: it grows without bound. A document repeating a query term a hundred times should not rank a hundred times higher than one repeating it once — at some point, extra repetitions add little information. Several normalised variants address this:
Logarithmic TF:
tf_log(t, d) = 1 + log(tf(t, d)) if tf > 0, else 0
Dampens large counts; a document with tf=100 scores 1 + log(100) = 3, not 100.
Boolean TF:
tf_bool(t, d) = 1 if tf > 0, else 0
Ignores frequency entirely; useful when presence matters more than count.
BM25 saturation — BM25 uses a more sophisticated saturation function controlled by the k₁ parameter (see the BM25 citation). This is the standard in modern search.
[illustrate: TF saturation curve — x-axis is raw term count (0 to 20), y-axis is TF contribution — three lines: raw TF (linear, steep), log TF (curves upward, flattening), BM25 saturation with k₁=1.2 (S-curve flattening at ~4) — showing diminishing returns as count increases]
Example
Document D1: "The cat sat on the mat. The cat was fat."
After tokenisation and lowercasing (stop words retained for clarity):
| Term | Raw TF |
|---|---|
| the | 3 |
| cat | 2 |
| sat | 1 |
| on | 1 |
| mat | 1 |
| was | 1 |
| fat | 1 |
For a query "cat mat", D1’s TF values are cat=2, mat=1.
Variants and history
TF was formalised as part of the vector space model by Gerard Salton in the 1970s. The TF×IDF combination was described by Sparck Jones (1972) for IDF and Salton & Buckley (1988) for the combined scheme.
Augmented TF normalises by the maximum TF in the document to reduce the advantage of longer documents:
tf_aug(t, d) = 0.5 + 0.5 × (tf(t, d) / max_tf(d))
BM25 effectively replaces augmented TF with its saturation formula, plus an additional document-length penalty. This is why BM25 outperforms raw TF-IDF for most workloads.
When to use it
TF on its own is rarely used directly. It is always combined with IDF (in TF-IDF) or folded into BM25 scoring. The practical decisions are:
- Use BM25 (the default in Elasticsearch, OpenSearch, Solr) — its TF saturation is superior to log or raw TF.
- Disable TF with
index_options: docsin Elasticsearch when field-level scoring is not needed (e.g. filter-only fields). This reduces index size and speeds up boolean queries. - Inspect TF via the
explainAPI to debug unexpected rankings — unexpectedly high TF values in a short document often indicate an analysis chain that is tokenising too aggressively.