Inverse Document Frequency

What it is

Inverse document frequency (IDF) quantifies how much information a term carries based on how rarely it appears across documents. The rarer the term, the higher its IDF, and the more weight it receives in relevance scoring.

IDF is the counterbalance to term frequency: TF rewards documents that repeat a query term; IDF rewards documents that contain distinctive query terms — terms that don’t appear everywhere.

How it works

The classic IDF formula (Spärck Jones, 1972):

idf(t) = log(N / df(t))

Where:

N = total number of documents in the corpus
df(t) = number of documents containing term t

For a corpus of 1,000,000 documents:

"the" appears in 999,000 documents → idf = log(1,000,000 / 999,000) ≈ 0.001
"search" appears in 50,000 documents → idf = log(1,000,000 / 50,000) ≈ 3.0
"transducer" appears in 12 documents → idf = log(1,000,000 / 12) ≈ 11.4

BM25 uses a smoothed IDF variant to avoid division-by-zero and negative values:

idf_bm25(t) = log(1 + (N - df(t) + 0.5) / (df(t) + 0.5))

Lucene’s default IDF is log(1 + N / df(t)) — a further simplification that keeps values positive.

[illustrate: log curve showing IDF on y-axis vs document frequency (df) on x-axis for a corpus of 1M documents — annotated points for common terms (the, is, a) clustering near the x-axis, mid-frequency terms (search, index) in the middle, and rare terms (transducer, lemmatisation) at the top — showing the steep IDF benefit of rare terms]

Example

Corpus N = 10:

Term	DF	IDF (log₁₀)
`the`	10	log(10/10) = 0.0
`fox`	3	log(10/3) ≈ 0.52
`jumped`	1	log(10/1) = 1.0

A query "the fox jumped" would weight "jumped" most heavily and "the" not at all — which matches intuition: "jumped" is the most discriminating term in the query.

Variants and history

IDF was proposed by Karen Spärck Jones in a 1972 paper, “A statistical interpretation of term specificity and its application in retrieval.” It remains one of the most influential ideas in information retrieval.

Common variants:

Probabilistic IDF (used in BM25): log((N - df + 0.5) / (df + 0.5))
Smoothed IDF: log(1 + N / df) — avoids zero for terms appearing in all documents
IDF with smoothing constant: log(N / (df + 1)) — adds 1 to DF to handle unseen terms at query time

In transformer-based ranking (SPLADE), IDF-like weighting is learned by the model rather than computed from corpus statistics.

When to use it

IDF is automatically computed and applied by all major search engines. Awareness of it matters for:

Domain-specific corpora — if your corpus is entirely about a narrow topic, terms central to that topic will have high DF and thus low IDF. You may need custom boosting or domain-specific stop lists to counteract this.
Multi-lingual corpora — IDF is computed globally; language-specific common terms will naturally settle near zero IDF.
Debugging unexpected scores — explain output in Elasticsearch shows IDF values per term, making it easy to see when a supposedly important term has a low weight due to high collection frequency.