Document Frequency

What it is

Document frequency (DF) is the count of documents in a corpus that contain a given term at least once. It is distinct from term frequency (TF), which counts occurrences within a single document.

DF is the basis of inverse document frequency (IDF): terms that appear in nearly every document (high DF) carry little discriminating power, while terms that appear in only a few documents (low DF) are strong relevance signals.

How it works

DF is computed directly from the inverted index: it is simply the length of the postings list for a term.

df(t) = number of documents containing term t
      = length of postings list for t

For a corpus of N documents:

  • df("the") = N (approximately) — “the” appears in almost every English document.
  • df("eigenvalue") = small number — a rare technical term.

DF is computed at index build time and updated incrementally as documents are added or removed. In Lucene-based engines, it is stored as a summary statistic alongside each term dictionary entry.

Example

Corpus of 10,000 documents. Spot DF values:

Term DF Rarity
the 9,980 extremely common
search 4,200 common
inverted 310 moderately rare
postings 88 rare
transducer 12 very rare

The IDF of transducer will be much higher than that of search, meaning it contributes more to the relevance score of documents that contain it.

Variants and history

DF was introduced by Karen Spärck Jones in her 1972 paper proposing inverse document frequency. She observed that terms appearing in fewer documents are better discriminators and should be weighted more heavily.

Related measures:

  • Collection frequency (CF) — total number of term occurrences across all documents (sum of TF across the corpus), rather than the number of documents. Used in language model smoothing.
  • Relative DFdf(t) / N, the fraction of documents containing the term; equivalent to the term’s probability of appearing in a randomly chosen document.

When to use it

DF is an internal implementation detail in all standard search engines — you don’t set it manually. The cases where it becomes visible:

  • IDF tuning — if your corpus is very small or domain-specific, IDF values can be misleading. Terms common in your domain (e.g. "patient" in a medical corpus) will have abnormally high DF and thus low IDF, potentially under-ranking the most relevant documents.
  • Custom scorers — if writing a custom Lucene Similarity class, DF is available via CollectionStatistics.docCount() and TermStatistics.docFreq().
  • Stop word decisions — terms with DF above a threshold (say, 95% of documents) are good candidates for the stop list.

See also