Shingle
What it is
A shingle is an n-gram — a contiguous sequence of n tokens or characters — extracted from a document and treated as a set element rather than a positional index entry.
The word itself carries meaning. In search and IR literature, “n-gram” implies a sequence with position: the third bigram of a document is distinct from the fifth. A “shingle” implies membership in a set: the document either contains this subsequence or it does not, and how many times it appears, or where, is discarded. That framing shift — from sequence to set — is what the terminology signals.
Why a different name?
N-gram and shingle describe the same extracted string. The distinction is in how it is used downstream:
| Term | Retains position? | Primary use |
|---|---|---|
| N-gram | Yes | Language modelling, phrase indexing, autocomplete |
| Shingle | No | Document similarity, near-duplicate detection |
When a paper or system says “shingle”, it is announcing that documents will be compared as unordered bags of substrings — not that the extraction procedure differs.
k-shingling and w-shingling
Two granularities are standard:
- k-shingling — the source text is treated as a raw character sequence; every contiguous run of k characters is one shingle. Robust to tokenisation differences and minor whitespace edits.
- w-shingling — the text is tokenised first; every contiguous run of w words is one shingle. More interpretable; sensitive to exact wording.
The letters k and w are conventional in the literature and map directly to n in n-gram notation.
The MinHash connection
Representing a document as its shingle set makes Jaccard similarity the natural measure of document overlap. Computing exact Jaccard across millions of document pairs is infeasible — O(N²) comparisons. MinHash provides an unbiased approximation by hashing each shingle set to a compact signature. That pipeline — extract shingles, hash to a MinHash signature, compare signatures — is the standard approach for near-duplicate detection in web crawls and LLM pre-training data deduplication.
The shingle is the atomic unit that pipeline depends on. Choosing k or w determines what counts as a shared feature between documents; everything downstream in the similarity pipeline follows from that choice.
See also
- Shingling — the full algorithm: set construction, Jaccard similarity, MinHash approximation, and practical guidance on choosing k
- N-Gram
- Character N-Gram
- Trigram