Fuzzy-Matching
-
Beider-Morse Phonetic Matching
Beider-Morse Phonetic Matching (BMPM) is a rule-based phonetic algorithm designed for Jewish surnames, applying language-specific phonological rules to match names across Yiddish, Hebrew, Russian, Polish, German, and other languages.
-
Cologne Phonetics
Cologne Phonetics (Kölner Phonetik) is a German phonetic algorithm that maps names to numeric codes, enabling phonetic matching across German spelling variations that Soundex cannot handle.
-
Daitch-Mokotoff Soundex
A Soundex variant developed for Slavic and Yiddish surnames that produces a six-digit numeric code and can return multiple codes per name to handle ambiguous digraph pronunciations.
-
Match Rating Approach
The Match Rating Approach encodes a name into a codex and then compares two codices using a defined similarity rating, returning a boolean match decision rather than leaving comparison to the caller.
-
NYSIIS
NYSIIS is a phonetic encoding algorithm developed in 1970 that maps names to letter-based codes, producing more accurate matches for North American names than Soundex.
-
Caverphone
Caverphone is a phonetic encoding algorithm designed for New Zealand English names, producing a 10-character code to match name variants across historical records.
-
Double Metaphone
Double Metaphone extends the original Metaphone algorithm by producing two phonetic codes per word — a primary and a secondary — to handle pronunciation ambiguity and non-English name patterns.
-
Metaphone
Metaphone encodes an English word into a variable-length string of consonant sounds, applying context-sensitive phonological rules that allow names with different spellings but similar pronunciations to match.
-
Metaphone 3
Metaphone 3 is a commercial phonetic algorithm by Lawrence Philips that extends Double Metaphone with a substantially larger rule set, claiming around 98% accuracy on English and European names.
-
Phonetic Encoding
Phonetic encoding maps a word to a compact code that represents its pronunciation, so that words which sound alike but are spelled differently produce the same code and match one another.
-
Soundex
Soundex maps a name to a four-character code — one letter plus three digits — so that names with similar pronunciations but different spellings produce the same code and match one another.
-
KStem
KStem is a conservative English stemmer that combines suffix-stripping with a built-in lexicon to avoid false conflations, producing cleaner stems than Porter2 at the cost of a dictionary dependency.
-
Trie
A trie is a tree where each path from root to node spells out a prefix, enabling O(k) term lookup, prefix enumeration, and autocomplete — where k is the length of the query string.
-
Character N-Gram
A character n-gram is a contiguous sequence of n characters extracted from a string, enabling tokenisation-free indexing, fuzzy search, language identification, and subword modelling.
-
Trigram
A trigram is an n-gram of length 3 — three consecutive tokens considered as a unit. Trigrams extend bigrams with one extra token of context, improving disambiguation at the cost of sparser counts.
-
Bigram
A bigram is an n-gram of length 2 — two consecutive tokens considered as a pair. Bigram models condition each token on the one before it, capturing local order that unigram models discard.