String-Matching
-
Beider-Morse Phonetic Matching
Beider-Morse Phonetic Matching (BMPM) is a rule-based phonetic algorithm designed for Jewish surnames, applying language-specific phonological rules to match names across Yiddish, Hebrew, Russian, Polish, German, and other languages.
-
Cologne Phonetics
Cologne Phonetics (Kölner Phonetik) is a German phonetic algorithm that maps names to numeric codes, enabling phonetic matching across German spelling variations that Soundex cannot handle.
-
Daitch-Mokotoff Soundex
A Soundex variant developed for Slavic and Yiddish surnames that produces a six-digit numeric code and can return multiple codes per name to handle ambiguous digraph pronunciations.
-
Match Rating Approach
The Match Rating Approach encodes a name into a codex and then compares two codices using a defined similarity rating, returning a boolean match decision rather than leaving comparison to the caller.
-
NYSIIS
NYSIIS is a phonetic encoding algorithm developed in 1970 that maps names to letter-based codes, producing more accurate matches for North American names than Soundex.
-
Caverphone
Caverphone is a phonetic encoding algorithm designed for New Zealand English names, producing a 10-character code to match name variants across historical records.
-
Double Metaphone
Double Metaphone extends the original Metaphone algorithm by producing two phonetic codes per word — a primary and a secondary — to handle pronunciation ambiguity and non-English name patterns.
-
Metaphone
Metaphone encodes an English word into a variable-length string of consonant sounds, applying context-sensitive phonological rules that allow names with different spellings but similar pronunciations to match.
-
Metaphone 3
Metaphone 3 is a commercial phonetic algorithm by Lawrence Philips that extends Double Metaphone with a substantially larger rule set, claiming around 98% accuracy on English and European names.
-
Phonetic Encoding
Phonetic encoding maps a word to a compact code that represents its pronunciation, so that words which sound alike but are spelled differently produce the same code and match one another.
-
Soundex
Soundex maps a name to a four-character code — one letter plus three digits — so that names with similar pronunciations but different spellings produce the same code and match one another.
-
Lovins Stemmer
The Lovins Stemmer is the earliest published stemming algorithm (1968), reducing English words to stems in a single pass by stripping the longest matching suffix from a table of 294 rules.
-
Paice/Husk Stemmer
The Paice/Husk Stemmer is an iterative English stemmer using a single compact rule table with a loop-back architecture, producing aggressively short stems at the cost of over-stemming.
-
Porter Stemmer
The Porter Stemmer is a rule-based English suffix-stripping algorithm that reduces words to a stem using five sequential transformation passes gated by a vowel-consonant measure.
-
Porter2 Stemmer
Porter2 is a revised English suffix-stripping algorithm from the Snowball project that fixes around 200 mis-stemmings in the original Porter Stemmer and is the default stemmer in Elasticsearch’s english analyser.
-
Unicode Normalisation
Unicode normalisation resolves the fact that a single visible character can be encoded multiple ways, standardising text to one of four forms — NFC, NFD, NFKC, or NFKD — before comparison, indexing, or hashing.