Soundex

What it is

Soundex is the original phonetic encoding algorithm, published by Robert C. Russell and Margaret King Odell in 1918. It was designed specifically for English surnames and was adopted by the US Census Bureau to group name variants in census and genealogy records — where the same family might appear as "Smith", "Smyth", and "Smithe" across different documents depending on how a clerk heard the name.

Every input produces a four-character code: the initial letter of the name preserved verbatim, followed by exactly three digits derived from the consonants that follow it. "Robert" becomes R163; "Rupert" also becomes R163. The collision is intentional — the algorithm treats them as phonetically equivalent.

The American Soundex, standardised by the National Archives and Records Administration (NARA), is the canonical version. It is available natively in most relational databases, which is the primary reason it remains in production use more than a century after its invention.

How it works

The algorithm applies five deterministic steps to any input name.

Step 1 — Retain the first letter. Copy the first character of the name verbatim into the output. It is never encoded into a digit, even if it would otherwise map to one.

Step 2 — Map remaining consonants to digit classes. Apply the following table to every character after the first:

Digit Letters
1 B, F, P, V
2 C, G, J, K, Q, S, X, Z
3 D, T
4 L
5 M, N
6 R
(drop) A, E, I, O, U, H, W, Y

Vowels, H, W, and Y are discarded — they contribute nothing to the digit sequence.

Step 3 — Remove adjacent duplicate digits. If two consecutive characters map to the same digit, keep only one. This collapses doubled consonants and adjacent same-class consonants: "Lloyd" maps to L-4-4 before deduplication and L4 after.

One subtlety: H and W are transparent for the purposes of adjacency. If two consonants from the same class are separated only by an H or W, they are still treated as adjacent and collapsed to a single digit. "Ashcraft" and "Aschraft" therefore produce the same code because the H between S and C is invisible to the algorithm.

Step 4 — Remove all zeros. After mapping, any remaining zeros (from dropped characters handled explicitly) are removed. In practice, steps 2 and 4 are often merged: simply skip vowels, H, W, and Y entirely during encoding rather than encoding them as 0 and stripping later.

Step 5 — Pad or truncate to exactly four characters. The output must be exactly four characters: the initial letter plus three digits. If fewer than three digits were produced, append trailing zeros. If more than three digits were produced, truncate after the third.

Example

Robert and Rupert — intended collision

Robert
R        → keep (first letter)
o        → drop (vowel)
b        → 1
e        → drop (vowel)
r        → 6
t        → 3

Digit sequence: 1, 6, 3 → no duplicates → R163

Rupert
R        → keep
u        → drop
p        → 1
e        → drop
r        → 6
t        → 3

Digit sequence: 1, 6, 3 → R163

Both names encode to R163. The algorithm judges them phonetically equivalent.

Smith and Schmidt — known collision

Smith
S        → keep
m        → 5
i        → drop
t        → 3
h        → drop

Digit sequence: 5, 3 → pad → S530

Schmidt
S        → keep
c        → 2
h        → transparent (drop)
m        → 5
i        → drop
d        → 3
t        → 3 (duplicate of previous — drop)

Digit sequence: 2, 5, 3 → S530

"Smith" and "Schmidt" both produce S530. This is a genuine limitation: the two names have different phonological origins, but Soundex’s coarse groupings cannot distinguish them.

Ashcraft

Ashcraft
A        → keep
s        → 2
h        → transparent
c        → 2 (same class as s — duplicate, drop)
r        → 6
a        → drop
f        → 1
t        → 3 (fourth digit — truncate)

Digit sequence: 2, 6, 1 → A261

Variants and history

Russell and Odell originally filed a US patent for the algorithm (US Patent 1,261,167, granted 1918). The version adopted by the US Census Bureau — the American Soundex — differs in minor respects from the original patent and is the de facto standard documented by NARA.

The most significant descendant is Daitch–Mokotoff Soundex (1985), developed to handle Eastern European Jewish surnames — Yiddish, Polish, and Russian names that the American Soundex encodes poorly or conflates. Daitch–Mokotoff uses a six-digit code, applies more refined consonant-class mappings, and generates multiple codes per input to capture alternative pronunciations. It is available as an option in some genealogy software and in the Elasticsearch analysis-phonetic plugin ("encoder": "daitch_mokotoff").

Other later algorithms — Metaphone, Double Metaphone, NYSIIS, Caverphone — each addressed specific deficiencies of Soundex. See Phonetic Encoding for an overview of the full family.

When to use it

Use Soundex when:

  • You need phonetic matching and your data store already provides it natively. MySQL, PostgreSQL, and SQL Server all expose a SOUNDEX() function. If you are already in SQL and need approximate name matching with zero additional infrastructure, Soundex costs nothing to add.
  • Your corpus consists primarily of English surnames in a genealogical or historical records context, which is exactly what the algorithm was designed for.
  • Recall matters more than precision. Soundex casts a wide net — it will surface more matches than a stricter algorithm, at the cost of more false positives.

Prefer a different algorithm when:

  • Your name corpus includes non-English names. Soundex’s consonant-class mappings reflect English phonology and produce unreliable codes for Arabic transliterations, East Asian romanisations, or Slavic names.
  • You need to distinguish names that differ only in vowels ("Robert" vs "Rupert") or that collide despite different phonological origins ("Smith" vs "Schmidt"). Double Metaphone handles both cases more accurately.
  • You are matching arbitrary text rather than proper names. Soundex was not designed for common nouns or full-text fields; applying it there produces noisy, low-precision results.
  • You are working in Elasticsearch or OpenSearch and have access to the analysis-phonetic plugin. In that environment, Double Metaphone or NYSIIS are straightforward alternatives and give better precision at similar recall.

Elasticsearch and OpenSearch — configuration. Install the analysis-phonetic plugin, then define a token filter with "type": "phonetic" and "encoder": "soundex". Apply it inside a custom analyser on name fields only.

{
  "settings": {
    "analysis": {
      "filter": {
        "soundex_filter": {
          "type": "phonetic",
          "encoder": "soundex",
          "replace": true
        }
      },
      "analyzer": {
        "soundex_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "soundex_filter"]
        }
      }
    }
  }
}

Setting "replace": true stores only the Soundex code in the index, discarding the original token. Set it to false if you want both the original and the code indexed — useful for a multi-field strategy where the raw name field handles exact matches and the phonetic field handles fuzzy ones.

SQL — native availability. All three major relational databases expose Soundex without any plugin or extension:

-- MySQL / MariaDB
SELECT * FROM people WHERE SOUNDEX(name) = SOUNDEX('Schmidt');

-- PostgreSQL (requires fuzzystrmatch extension)
SELECT * FROM people WHERE soundex(name) = soundex('Schmidt');

-- SQL Server
SELECT * FROM people WHERE SOUNDEX(name) = SOUNDEX('Schmidt');

The query pattern — encode both sides of the comparison — ensures the index can be used if a functional index on SOUNDEX(name) exists.

See also