Document Frequency
What it is
Document frequency (DF) is the count of documents in a corpus that contain a given term at least once. It is distinct from term frequency (TF), which counts occurrences within a single document.
DF is the basis of inverse document frequency (IDF): terms that appear in nearly every document (high DF) carry little discriminating power, while terms that appear in only a few documents (low DF) are strong relevance signals.
How it works
DF is computed directly from the inverted index: it is simply the length of the postings list for a term.
df(t) = number of documents containing term t
= length of postings list for t
For a corpus of N documents:
df("the") = N(approximately) — “the” appears in almost every English document.df("eigenvalue") = small number— a rare technical term.
DF is computed at index build time and updated incrementally as documents are added or removed. In Lucene-based engines, it is stored as a summary statistic alongside each term dictionary entry.
Example
Corpus of 10,000 documents. Spot DF values:
| Term | DF | Rarity |
|---|---|---|
the |
9,980 | extremely common |
search |
4,200 | common |
inverted |
310 | moderately rare |
postings |
88 | rare |
transducer |
12 | very rare |
The IDF of transducer will be much higher than that of search, meaning it contributes more to the relevance score of documents that contain it.
Variants and history
DF was introduced by Karen Spärck Jones in her 1972 paper proposing inverse document frequency. She observed that terms appearing in fewer documents are better discriminators and should be weighted more heavily.
Related measures:
- Collection frequency (CF) — total number of term occurrences across all documents (sum of TF across the corpus), rather than the number of documents. Used in language model smoothing.
- Relative DF —
df(t) / N, the fraction of documents containing the term; equivalent to the term’s probability of appearing in a randomly chosen document.
When to use it
DF is an internal implementation detail in all standard search engines — you don’t set it manually. The cases where it becomes visible:
- IDF tuning — if your corpus is very small or domain-specific, IDF values can be misleading. Terms common in your domain (e.g.
"patient"in a medical corpus) will have abnormally high DF and thus low IDF, potentially under-ranking the most relevant documents. - Custom scorers — if writing a custom Lucene Similarity class, DF is available via
CollectionStatistics.docCount()andTermStatistics.docFreq(). - Stop word decisions — terms with DF above a threshold (say, 95% of documents) are good candidates for the stop list.