TF-IDF

What it is

TF-IDF is a weighting scheme that assigns each term in a document a score reflecting two competing intuitions: a term is important if it appears often in this document, but less informative if it appears everywhere across the corpus. Multiply those two factors together and you get a number that is high for terms that are distinctive to a document and low for terms that are generic filler.

The scheme is used as a relevance signal in full-text search, as a feature vector for text classification, and as a baseline for document similarity. It predates modern neural approaches by decades and remains a useful reference point and practical fallback.

How it works

The score for a term t in a document d drawn from a corpus of N documents is:

TF-IDF(t, d) = TF(t, d) × IDF(t)

Term Frequency (TF) measures how often a term appears in a specific document. The raw count works, but the log-normalised form is standard because it dampens the effect of a term appearing 50 times versus 5 times — the 50th occurrence carries far less new information than the 5th:

TF(t, d) = log(1 + count(t, d))

Inverse Document Frequency (IDF) measures how rare the term is across all documents. df(t) is the number of documents containing at least one occurrence of t:

IDF(t) = log(N / df(t))

The log here compresses the range — without it, a term appearing in one document out of a million would receive a raw IDF of 1 000 000, dwarfing TF entirely. A smoothed variant adds 1 to the denominator to avoid division-by-zero for unseen terms:

IDF(t) = log(N / (1 + df(t)))

Putting it together:

TF-IDF(t, d) = log(1 + count(t, d)) × log(N / df(t))

Common words like “the” or “is” appear in nearly every document, so df(t) ≈ N and IDF ≈ 0 — they contribute almost nothing to the score. A rare, specific term like “eigenvector” might appear in only a handful of documents, giving it a high IDF and making any document containing it rank prominently for queries involving it.

[illustrate: two documents scoring against a two-term query — stacked bar segments per document showing TF contribution (log-normalised count) and IDF contribution (log of N/df) for each term, with the final TF-IDF product labelled per term and a total score at the top of each bar]

Example

Corpus of three documents (N = 3):

ID Text
D1 “the dog sat on the mat”
D2 “the cat sat on the mat”
D3 “the dog chased the cat”

Query term: “dog”

First, compute IDF. “dog” appears in D1 and D3, so df("dog") = 2:

IDF("dog") = log(3 / 2) = log(1.5) ≈ 0.405

Now compute TF for each document (using natural log, log(1 + count)):

Doc count(“dog”) TF = log(1 + count) TF-IDF
D1 1 log(2) ≈ 0.693 0.693 × 0.405 ≈ 0.281
D2 0 log(1) = 0 0.000 × 0.405 = 0.000
D3 1 log(2) ≈ 0.693 0.693 × 0.405 ≈ 0.281

D1 and D3 score equally; D2 scores zero because “dog” does not appear there. Now consider “sat” (df = 2, same IDF ≈ 0.405). Both D1 and D2 contain it once. A query for “dog sat” would sum the TF-IDF scores per term:

  • D1: 0.281 (dog) + 0.281 (sat) = 0.562
  • D2: 0.000 (dog) + 0.281 (sat) = 0.281
  • D3: 0.281 (dog) + 0.000 (sat) = 0.281

D1 ranks first because it is the only document containing both query terms.

[illustrate: step-by-step TF-IDF calculation for the term “dog” across D1, D2, D3 — show count → TF transformation on a number line, then IDF as a horizontal bar representing log(N/df), then the product as a final bar per document]

Variants and history

TF-IDF emerged from work in information retrieval during the 1960s–70s. Karen Spärck Jones formalised the IDF component in a 1972 paper, arguing that term specificity should be accounted for in retrieval weights. The combined TF-IDF formulation became standard following work by Salton and Buckley in the 1980s through the SMART retrieval system.

Several normalisation variants exist:

  • Sublinear TF scalinglog(1 + count), as shown above; the most common form.
  • Double normalisation (0.5)0.5 + 0.5 × (count / max_count_in_doc), which scales TF relative to the most frequent term in the document, preventing long documents from always winning on raw counts.
  • BM25 IDF — replaces log(N / df) with log((N - df + 0.5) / (df + 0.5)), which can go negative for very common terms and is part of what makes BM25 slightly more principled than raw TF-IDF in practice.
  • TF-IDF vectors — when applied to entire documents (not just per-query-term), each document becomes a vector in term-space, enabling cosine similarity as a document-to-document or query-to-document distance metric.

When to use it

TF-IDF is a reasonable choice when:

  • You need a fast, dependency-free relevance signal with no training data required.
  • You are building a document similarity or clustering pipeline and need a sparse vector representation.
  • You want an interpretable baseline before committing to heavier machinery.

Prefer BM25 over raw TF-IDF for search ranking in almost every case. BM25 adds term frequency saturation (so a term repeated 100 times in a document does not receive 100× the score of a single occurrence) and length normalisation (so long documents do not systematically outrank short ones). Both weaknesses exist in standard TF-IDF and have measurable effects on retrieval quality.

Move beyond both for semantic or paraphrastic queries. A user searching for “affordable accommodation” will get no credit from a TF-IDF system for documents that only use the phrase “cheap hotel” — the vocabulary mismatch means the term overlap is zero. Dense retrieval models and hybrid search pipelines address this at the cost of added infrastructure complexity.

See also