Fuzzy-Matching

Beider-Morse Phonetic Matching

Beider-Morse Phonetic Matching (BMPM) is a rule-based phonetic algorithm designed for Jewish surnames, applying language-specific phonological rules to match names across Yiddish, Hebrew, Russian, Polish, German, and other languages.
Cologne Phonetics

Cologne Phonetics (Kölner Phonetik) is a German phonetic algorithm that maps names to numeric codes, enabling phonetic matching across German spelling variations that Soundex cannot handle.
Daitch-Mokotoff Soundex

A Soundex variant developed for Slavic and Yiddish surnames that produces a six-digit numeric code and can return multiple codes per name to handle ambiguous digraph pronunciations.
Match Rating Approach

The Match Rating Approach encodes a name into a codex and then compares two codices using a defined similarity rating, returning a boolean match decision rather than leaving comparison to the caller.
NYSIIS

NYSIIS is a phonetic encoding algorithm developed in 1970 that maps names to letter-based codes, producing more accurate matches for North American names than Soundex.
Caverphone

Caverphone is a phonetic encoding algorithm designed for New Zealand English names, producing a 10-character code to match name variants across historical records.
Double Metaphone

Double Metaphone extends the original Metaphone algorithm by producing two phonetic codes per word — a primary and a secondary — to handle pronunciation ambiguity and non-English name patterns.
Metaphone

Metaphone encodes an English word into a variable-length string of consonant sounds, applying context-sensitive phonological rules that allow names with different spellings but similar pronunciations to match.
Metaphone 3

Metaphone 3 is a commercial phonetic algorithm by Lawrence Philips that extends Double Metaphone with a substantially larger rule set, claiming around 98% accuracy on English and European names.
Phonetic Encoding

Phonetic encoding maps a word to a compact code that represents its pronunciation, so that words which sound alike but are spelled differently produce the same code and match one another.
Soundex

Soundex maps a name to a four-character code — one letter plus three digits — so that names with similar pronunciations but different spellings produce the same code and match one another.
KStem

KStem is a conservative English stemmer that combines suffix-stripping with a built-in lexicon to avoid false conflations, producing cleaner stems than Porter2 at the cost of a dictionary dependency.
Trie

A trie is a tree where each path from root to node spells out a prefix, enabling O(k) term lookup, prefix enumeration, and autocomplete — where k is the length of the query string.
Character N-Gram

A character n-gram is a contiguous sequence of n characters extracted from a string, enabling tokenisation-free indexing, fuzzy search, language identification, and subword modelling.
Trigram

A trigram is an n-gram of length 3 — three consecutive tokens considered as a unit. Trigrams extend bigrams with one extra token of context, improving disambiguation at the cost of sparser counts.
Bigram

A bigram is an n-gram of length 2 — two consecutive tokens considered as a pair. Bigram models condition each token on the one before it, capturing local order that unigram models discard.