Caverphone

What it is

Caverphone is a phonetic encoding algorithm developed by David Hood at the University of Otago, New Zealand, in 2002. It was created for the Caversham Project, a large-scale digitisation effort to index historical New Zealand records — parish registers, electoral rolls, and similar documents where names were transcribed by ear and spelling is unreliable.

The algorithm’s name is a portmanteau of the project name and “phone” (sound). Its central design goal sets it apart from algorithms like Soundex or Double Metaphone: rather than encoding general English phonetics, Caverphone targets the specific phonological features of New Zealand English, including Māori names (which follow Polynesian phonotactics) and names of Pacific Island origin alongside European surnames.

Every input produces a 10-character code, padded with trailing 1s if shorter. Unlike Soundex’s letter-plus-digits format or Metaphone’s consonant string, Caverphone codes are a fixed-width alphanumeric string produced through a rule-table substitution process.

How it works

The algorithm works in two phases: a substitution phase that rewrites the input string using ordered letter-sequence rules, followed by a coding phase that maps the result to the final fixed-width code.

In the substitution phase, a sequence of find-and-replace rules is applied from left to right in strict order. A representative sample:

Silent or variant sequences are normalised: cough → cof, rough → rof, tough → tof, enough → enof
The digraph gh is mapped to a digit placeholder: gh → 2
wh (important for Māori names, where it can be pronounced as /f/ or /w/ depending on dialect) is normalised
Trailing e and similar silent-vowel patterns are removed
H and W are dropped when they are not phonetically significant

After substitutions, remaining vowel sounds at the start of the string are preserved; internal vowels are suppressed. The result is padded with trailing 1s to exactly 10 characters.

Example

Because the substitution table is long and order-dependent, the safest way to inspect Caverphone output is through a library:

import jellyfish

jellyfish.caverphone2("Lee")        # "L100000000"
jellyfish.caverphone2("Leigh")      # "L100000000"  — same code; match
jellyfish.caverphone2("Thompson")   # "TMPSN10000"
jellyfish.caverphone2("Thomson")    # "TMPSN10000"  — same code; match

"Lee" and "Leigh" collapse to the same code despite their different spellings, which is exactly the behaviour needed when indexing a 19th-century New Zealand electoral roll.

Variants and history

Caverphone 2 (2004), also by Hood, is a revised version with updated substitution rules that improve accuracy on certain name patterns. The output format is identical — a 10-character code — but the rule table differs. Most libraries and search platforms implement Caverphone 2 by default; Caverphone 1 is retained for reproducibility with older indexed data.

When to use it

Caverphone is the right choice when your data consists primarily of New Zealand or Australian historical records, particularly genealogy applications where name spelling varies by transcriber. Its handling of Māori wh digraphs and Polynesian name structures gives it an advantage that general-purpose algorithms cannot replicate.

Outside this domain, the algorithm offers little benefit over Double Metaphone, which covers a broader range of name origins and is available in more search platforms. If your corpus is mixed — European names alongside NZ/Pacific names — Double Metaphone is the safer default.

Elasticsearch / OpenSearch (via the analysis-phonetic plugin):

{
  "filter": {
    "my_caverphone": {
      "type": "phonetic",
      "encoder": "caverphone2"
    }
  }
}

Python (jellyfish):

import jellyfish

jellyfish.caverphone1("Katherine")
jellyfish.caverphone2("Katherine")