String-Matching

Beider-Morse Phonetic Matching

Beider-Morse Phonetic Matching (BMPM) is a rule-based phonetic algorithm designed for Jewish surnames, applying language-specific phonological rules to match names across Yiddish, Hebrew, Russian, Polish, German, and other languages.
Cologne Phonetics

Cologne Phonetics (Kölner Phonetik) is a German phonetic algorithm that maps names to numeric codes, enabling phonetic matching across German spelling variations that Soundex cannot handle.
Daitch-Mokotoff Soundex

A Soundex variant developed for Slavic and Yiddish surnames that produces a six-digit numeric code and can return multiple codes per name to handle ambiguous digraph pronunciations.
Match Rating Approach

The Match Rating Approach encodes a name into a codex and then compares two codices using a defined similarity rating, returning a boolean match decision rather than leaving comparison to the caller.
NYSIIS

NYSIIS is a phonetic encoding algorithm developed in 1970 that maps names to letter-based codes, producing more accurate matches for North American names than Soundex.
Caverphone

Caverphone is a phonetic encoding algorithm designed for New Zealand English names, producing a 10-character code to match name variants across historical records.
Double Metaphone

Double Metaphone extends the original Metaphone algorithm by producing two phonetic codes per word — a primary and a secondary — to handle pronunciation ambiguity and non-English name patterns.
Metaphone

Metaphone encodes an English word into a variable-length string of consonant sounds, applying context-sensitive phonological rules that allow names with different spellings but similar pronunciations to match.
Metaphone 3

Metaphone 3 is a commercial phonetic algorithm by Lawrence Philips that extends Double Metaphone with a substantially larger rule set, claiming around 98% accuracy on English and European names.
Phonetic Encoding

Phonetic encoding maps a word to a compact code that represents its pronunciation, so that words which sound alike but are spelled differently produce the same code and match one another.
Soundex

Soundex maps a name to a four-character code — one letter plus three digits — so that names with similar pronunciations but different spellings produce the same code and match one another.
Lovins Stemmer

The Lovins Stemmer is the earliest published stemming algorithm (1968), reducing English words to stems in a single pass by stripping the longest matching suffix from a table of 294 rules.
Paice/Husk Stemmer

The Paice/Husk Stemmer is an iterative English stemmer using a single compact rule table with a loop-back architecture, producing aggressively short stems at the cost of over-stemming.
Porter Stemmer

The Porter Stemmer is a rule-based English suffix-stripping algorithm that reduces words to a stem using five sequential transformation passes gated by a vowel-consonant measure.
Porter2 Stemmer

Porter2 is a revised English suffix-stripping algorithm from the Snowball project that fixes around 200 mis-stemmings in the original Porter Stemmer and is the default stemmer in Elasticsearch’s english analyser.
Unicode Normalisation

Unicode normalisation resolves the fact that a single visible character can be encoded multiple ways, standardising text to one of four forms — NFC, NFD, NFKC, or NFKD — before comparison, indexing, or hashing.