Shingling

N-Gram Similarity Near-Duplicate-Detection Minhash Jaccard Needs-Review

What it is

A shingle is an n-gram drawn from a document and treated as a set element rather than a sequence position. Shingling is the process of extracting the complete set of shingles from a document so that two documents can be compared by how much their sets overlap.

The key shift from ordinary n-gram indexing is the move from sequence to set. A standard n-gram index cares about position — which n-gram appeared where. A shingle set discards position entirely and asks only: which n-grams are present at all? That reduction makes it cheap to estimate document-level similarity across millions of pairs without comparing them word by word.

Shingles operate at two granularities:

Character shingles (k-shingling): the text is treated as a character sequence and every k-length character substring is extracted. "hello" with k = 3 yields {hel, ell, llo}.
Word shingles (w-shingling): the text is tokenised first; every contiguous sequence of w words is one shingle. "the quick brown fox" with w = 2 yields {the quick, quick brown, brown fox}.

How it works

Shingle set construction. Given a document d and width k, slide a window of width k across every character (or token) in d and add each window’s content to a set. Duplicates are dropped — the output is a set, not a multiset.

Jaccard similarity. Once two documents A and B are represented as shingle sets, their resemblance is the Jaccard coefficient:

J(A, B) = |A ∩ B| / |A ∪ B|

J = 1.0 means the documents are identical under this representation; J = 0.0 means they share no shingles at all. Values above roughly 0.5 are a reliable signal of near-duplication for most web-scale text.

[illustrate: two documents each shown as a row of overlapping shingle chips; a Venn diagram between them with the intersection (shared shingles) highlighted and the Jaccard formula labelled with the counts filled in]

The scale problem. Computing exact Jaccard for every pair of documents in a corpus of N documents requires O(N²) set comparisons. At web scale that is infeasible. MinHash provides an unbiased approximation: apply multiple hash functions to a shingle set and keep only each function’s minimum hash value. The fraction of hash functions that agree between two documents’ MinHash signatures estimates their Jaccard similarity without materialising the full sets. A full treatment belongs in a dedicated MinHash citation, but it is the reason shingling is practical at scale.

Example

Two short sentences at the character level with k = 4:

A = "the cat sat"
B = "the cat sat on the mat"

Shingles of A (k = 4, spaces included): {the , he c, e ca, cat , at s, t sa, sat}

Shingles of B (k = 4): {the , he c, e ca, cat , at s, t sa, sat, on , on , n t, ...}

|A ∩ B| = 7 (all of A’s shingles appear in B) |A ∪ B| = 13

J(A, B) = 7 / 13 ≈ 0.54

A is a prefix of B, so a Jaccard around 0.5 is expected — B adds new material that A does not contain.

Variants and history

Shingling was introduced by Andrei Broder and colleagues at AltaVista in the late 1990s as part of the system for identifying near-duplicate web pages. The original paper — “On the resemblance and containment of documents” (Broder, 1997) — defined both the shingle set model and the MinHash estimator in one contribution.

k-shingling (character-level) is preferred for plagiarism detection and near-duplicate web page identification. Typical k values:

k = 5: suitable for short texts (tweets, product titles) where documents are small and even a few shared shingles are informative.
k = 9–10: the common choice for web-scale near-duplicate detection. Long shingles are unlikely to collide by chance, so a high Jaccard score is a strong duplication signal.

w-shingling (word-level) is used for passage similarity and document-level topic overlap. Because shingles are words rather than characters, the set is smaller and more interpretable, but also more sensitive to minor wording differences — a one-word substitution breaks every shingle that spans that position.

Weighted shingling assigns a frequency or importance weight to each shingle, moving toward a weighted Jaccard that can account for term salience. This is less common but appears in some document clustering pipelines.

When to use it

Good fits:

Near-duplicate detection at scale. Shingling combined with MinHash and locality-sensitive hashing is the standard pipeline for deduplicating web crawls and training corpora. If you need to remove duplicate or near-duplicate documents before indexing or training, this is the right starting point.
Plagiarism detection. Character-level shingling is robust to whitespace changes, punctuation substitutions, and minor rewordings that would defeat exact-match approaches.
Dataset deduplication. Pre-training data for language models routinely uses MinHash over word shingles to remove near-duplicates before training, improving data quality and reducing memorisation.

Poor fits:

Semantic similarity. Shingles capture surface-form overlap only. Two paraphrases — "cheap hotel" vs "affordable accommodation" — share no shingles at all and will score J = 0. Use dense vector embeddings for semantic comparison.
Multilingual corpora. Character shingles from different scripts share almost nothing by construction. Cross-lingual near-duplicate detection requires language-aware normalisation or embedding-based methods.
Very short texts. A 5-word sentence with k = 9 produces zero character shingles. Choose k relative to the shortest document in your corpus; a rough rule is k ≤ (minimum document length) / 3.