Near-Duplicate-Detection
-
SimHash
Fingerprinting algorithm preserving cosine similarity. Maps similar documents to nearby hashes; enables efficient near-duplicate detection.
-
Shingling
Shingling represents a document as its set of overlapping n-grams (shingles), enabling near-duplicate detection via Jaccard similarity or MinHash approximations.
-
Shingle
A shingle is an n-gram treated as a set element for document comparison. The term signals a shift from positional sequence analysis to set-based similarity measurement.