Daitch-Mokotoff Soundex
What it is
Daitch-Mokotoff Soundex (DM Soundex) is a phonetic encoding algorithm designed to match surnames that sound alike despite spelling variation. It was developed by Gary Daitch and Randy Mokotoff in 1985 and published in Avotaynu: The International Review of Jewish Genealogy.
The algorithm targets a specific and difficult domain: Slavic surnames (Polish, Russian, Czech, Hungarian) and Yiddish and Ashkenazi Jewish surnames that have been transliterated into Latin script through multiple national conventions. American Soundex, designed for English names, performs poorly on this material — DM Soundex replaces its coarse consonant table and short four-character code with a more refined system tuned to Central and Eastern European phonology.
How it works
DM Soundex encodes a name into a six-digit numeric string. Unlike American Soundex, the output contains no leading letter — the entire name, including the first sound, is encoded into digits.
The encoding process:
- Work through the name character by character, left to right.
- At each position, check for multi-character digraphs (cz, sz, rz, dz, ch, etc.) before single characters — the longest match wins.
- Map the character or digraph to a digit using the DM coding table.
- Apply the adjacent-duplicate rule: if the same digit would be produced twice in succession (and no vowel separates the consonant groups), record it only once.
- Initial vowels map to digit
0; subsequent vowels are used only to break consonant groups, not coded as digits themselves. - Pad with trailing zeros to six digits, or truncate at six digits if the name is long.
The coding groups (simplified):
| Digit | Characters encoded |
|---|---|
| 0 | A E I O U (initial position only) |
| 1 | B, P, F, V |
| 2 | C, S, K, G, Z — and variant forms |
| 3 | D, T |
| 4 | L |
| 5 | M, N |
| 6 | R |
The actual DM table is considerably more detailed: Polish digraphs, transliterated German umlauts, and Yiddish phonemes each have explicit entries. The table is the heart of the algorithm, and implementations should use the published reference table rather than the simplified groups above.
[illustrate: step-by-step DM Soundex encoding of “Moskowitz” — show the name split into character groups (M, O, S, K, O, W, I, T, Z), each mapped to its digit in a two-row table (input / code), adjacent duplicates collapsed, result padded to six digits]
Multiple codes per name
This is the most distinctive feature of DM Soundex. Certain digraphs are phonetically ambiguous: the combination “CH” is pronounced /k/ in names of Germanic origin and /tʃ/ in Slavic names. When the algorithm encounters such a digraph, it branches — producing two (or more) codes rather than choosing one arbitrarily.
A name containing two such digraphs can theoretically produce four codes. In practice, most names yield one or two.
[illustrate: branching diagram for encoding “Cohn” — single input splits at the ambiguous CH into two parallel encoding paths, each producing a distinct six-digit code; both codes shown as leaf nodes]
Example
Take the surnames Moskowitz and Moskovitch — variant spellings of the same Eastern European surname, one a German transliteration and one an English rendering of a Slavic original.
Encoding “Moskowitz”:
M O S K O W I T Z
5 — 2 2 — — — 3 2
Adjacent duplicate 2 (S and K map to the same group) is collapsed. Vowels between code groups are dropped. Result: a six-digit code where one variant will align with a code produced by “Moskovitch”.
The digraph “TCH” in “Moskovitch” is ambiguous; it produces two codes. One of those codes will match the Moskowitz code, allowing a database query on either spelling to retrieve records filed under the other.
[illustrate: side-by-side before/after showing raw spellings “Moskowitz” and “Moskovitch” on the left, their DM Soundex code sets on the right, with the shared code highlighted to show the match]
Variants and history
DM Soundex was published in 1985 as a direct response to the inadequacy of American Soundex for Jewish genealogy research. The algorithm was adopted rapidly by genealogical databases serving Ashkenazi Jewish records — JRI-Poland, MyHeritage, and others index surnames using DM Soundex codes to enable cross-spelling search.
There is no widely adopted successor. Double Metaphone and Metaphone 3 handle some of the same phonological territory but are not specifically calibrated to Slavic and Yiddish morphology, and neither stores multiple codes in the same structured way for genealogical database indexing.
When to use it
Use DM Soundex when your data contains Central or Eastern European surnames — particularly in genealogy, archival, or immigration records where the same name may appear in Polish, Russian, German, Yiddish, and anglicised forms in different documents.
The multiple-codes feature requires that your data layer store and query against arrays of codes rather than a single code per record. When indexing, store all codes a name produces. When querying, generate all codes for the query name and perform an OR match.
Tradeoffs:
- More precise than American Soundex for Slavic names; produces fewer false positives due to six-digit length.
- More complex to implement correctly — the digraph table must be applied in priority order.
- Multiple codes increase index size but are essential for recall in ambiguous cases.
- For general English-language names, American Soundex or Metaphone remain simpler and adequate.
- For multilingual name matching beyond Slavic and Yiddish, consider Double Metaphone or Metaphone 3, which cover a broader phonological range.
Elasticsearch supports DM Soundex natively via the analysis-phonetic plugin:
{
"filter": {
"dm_soundex_filter": {
"type": "phonetic",
"encoder": "daitch_mokotoff",
"replace": false
}
}
}
Setting "replace": false retains the original token alongside the phonetic code, which is the recommended configuration for search.
Python — the jellyfish library does not include DM Soundex. Use a dedicated genealogy library and treat the return value as a list — even for unambiguous names — because multi-code output is the normal case:
# Pseudocode — verify your library's API before use
codes = dm_soundex("Moskovitch")
# Returns a list: ["564300", "563400"] or similar
Always store and query all codes in the list; discarding any code will silently reduce recall.