TF-IDF
What it is
TF-IDF is a weighting scheme that assigns each term in a document a score reflecting two competing intuitions: a term is important if it appears often in this document, but less informative if it appears everywhere across the corpus. Multiply those two factors together and you get a number that is high for terms that are distinctive to a document and low for terms that are generic filler.
The scheme is used as a relevance signal in full-text search, as a feature vector for text classification, and as a baseline for document similarity. It predates modern neural approaches by decades and remains a useful reference point and practical fallback.
How it works
The score for a term t in a document d drawn from a corpus of N documents is:
TF-IDF(t, d) = TF(t, d) × IDF(t)
Term Frequency (TF) measures how often a term appears in a specific document. The raw count works, but the log-normalised form is standard because it dampens the effect of a term appearing 50 times versus 5 times — the 50th occurrence carries far less new information than the 5th:
TF(t, d) = log(1 + count(t, d))
Inverse Document Frequency (IDF) measures how rare the term is across all documents. df(t) is the number of documents containing at least one occurrence of t:
IDF(t) = log(N / df(t))
The log here compresses the range — without it, a term appearing in one document out of a million would receive a raw IDF of 1 000 000, dwarfing TF entirely. A smoothed variant adds 1 to the denominator to avoid division-by-zero for unseen terms:
IDF(t) = log(N / (1 + df(t)))
Putting it together:
TF-IDF(t, d) = log(1 + count(t, d)) × log(N / df(t))
Common words like “the” or “is” appear in nearly every document, so df(t) ≈ N and IDF ≈ 0 — they contribute almost nothing to the score. A rare, specific term like “eigenvector” might appear in only a handful of documents, giving it a high IDF and making any document containing it rank prominently for queries involving it.
[illustrate: two documents scoring against a two-term query — stacked bar segments per document showing TF contribution (log-normalised count) and IDF contribution (log of N/df) for each term, with the final TF-IDF product labelled per term and a total score at the top of each bar]
Example
Corpus of three documents (N = 3):
| ID | Text |
|---|---|
| D1 | “the dog sat on the mat” |
| D2 | “the cat sat on the mat” |
| D3 | “the dog chased the cat” |
Query term: “dog”
First, compute IDF. “dog” appears in D1 and D3, so df("dog") = 2:
IDF("dog") = log(3 / 2) = log(1.5) ≈ 0.405
Now compute TF for each document (using natural log, log(1 + count)):
| Doc | count(“dog”) | TF = log(1 + count) | TF-IDF |
|---|---|---|---|
| D1 | 1 | log(2) ≈ 0.693 | 0.693 × 0.405 ≈ 0.281 |
| D2 | 0 | log(1) = 0 | 0.000 × 0.405 = 0.000 |
| D3 | 1 | log(2) ≈ 0.693 | 0.693 × 0.405 ≈ 0.281 |
D1 and D3 score equally; D2 scores zero because “dog” does not appear there. Now consider “sat” (df = 2, same IDF ≈ 0.405). Both D1 and D2 contain it once. A query for “dog sat” would sum the TF-IDF scores per term:
- D1: 0.281 (dog) + 0.281 (sat) = 0.562
- D2: 0.000 (dog) + 0.281 (sat) = 0.281
- D3: 0.281 (dog) + 0.000 (sat) = 0.281
D1 ranks first because it is the only document containing both query terms.
[illustrate: step-by-step TF-IDF calculation for the term “dog” across D1, D2, D3 — show count → TF transformation on a number line, then IDF as a horizontal bar representing log(N/df), then the product as a final bar per document]
Variants and history
TF-IDF emerged from work in information retrieval during the 1960s–70s. Karen Spärck Jones formalised the IDF component in a 1972 paper, arguing that term specificity should be accounted for in retrieval weights. The combined TF-IDF formulation became standard following work by Salton and Buckley in the 1980s through the SMART retrieval system.
Several normalisation variants exist:
- Sublinear TF scaling —
log(1 + count), as shown above; the most common form. - Double normalisation (0.5) —
0.5 + 0.5 × (count / max_count_in_doc), which scales TF relative to the most frequent term in the document, preventing long documents from always winning on raw counts. - BM25 IDF — replaces
log(N / df)withlog((N - df + 0.5) / (df + 0.5)), which can go negative for very common terms and is part of what makes BM25 slightly more principled than raw TF-IDF in practice. - TF-IDF vectors — when applied to entire documents (not just per-query-term), each document becomes a vector in term-space, enabling cosine similarity as a document-to-document or query-to-document distance metric.
When to use it
TF-IDF is a reasonable choice when:
- You need a fast, dependency-free relevance signal with no training data required.
- You are building a document similarity or clustering pipeline and need a sparse vector representation.
- You want an interpretable baseline before committing to heavier machinery.
Prefer BM25 over raw TF-IDF for search ranking in almost every case. BM25 adds term frequency saturation (so a term repeated 100 times in a document does not receive 100× the score of a single occurrence) and length normalisation (so long documents do not systematically outrank short ones). Both weaknesses exist in standard TF-IDF and have measurable effects on retrieval quality.
Move beyond both for semantic or paraphrastic queries. A user searching for “affordable accommodation” will get no credit from a TF-IDF system for documents that only use the phrase “cheap hotel” — the vocabulary mismatch means the term overlap is zero. Dense retrieval models and hybrid search pipelines address this at the cost of added infrastructure complexity.