I objected to deprecating metaphone(). Then I read the RFC.

Failing Closed: What Ship... I Forked a Dead PHP Name ...

A while back a proposal landed on the PHP internals list: deprecate metaphone(). My first reaction was the reflex of someone who has spent years as a PHP release master. Leave the old string functions alone, people depend on them, deprecation churn is its own tax. I was ready to argue against it.

Then I read the reasoning and went looking at what phonetic name matching is actually supposed to do in 2026. I changed my mind. metaphone() should go. What surprised me was not that the function is dated. It was how far the field had moved while PHP core stood still, and how weak the replacement path the RFC points at really is. So I built the replacement I wish it had recommended.

metaphone() is the oldest, least accurate version of an idea that kept evolving

The RFC is right, and here is the short version of why. The metaphone() in core is the original 1990 algorithm: English-only, single-key, tuned for one accent of one language. It was superseded twice. First by Double Metaphone (Lawrence Philips, 2000), which emits a primary and an alternate key so ambiguous pronunciations still match. Then by Metaphone 3, which corrects hundreds more edge cases.

Core shipped the first version and never moved. soundex() is older still, a 1918 patent. Both encode a single English-centric key, and both fall over the moment a name crosses a language boundary or carries a transliteration variant. For the one job these functions exist to do, collapsing names that sound alike but are spelled differently, they are the weakest tools in the drawer.

So the deprecation is defensible. Maintaining the least accurate member of a whole family of algorithms, in C, in core, when most applications can do better elsewhere, is not a good use of anyone's time. I came around to that part.

"Use a userland library" is the wrong replacement

Here is where I part ways with the RFC. Its answer to "what do I use instead" is that there are actively maintained Composer libraries implementing Double Metaphone. That is true, and it is also the wrong instinct.

Phonetic encoding is a hot inner-loop operation. You run it over every name in a dataset, sometimes millions of them, to build a match index you can query later. A pure-PHP implementation pays interpreter overhead on every character of every name. The honest replacement for a native C string function is another native C string function, not a userland reimplementation that is correct but an order of magnitude slower. That performance gap is the entire reason the code lived in core to begin with.

That is what moved me from "argue against the RFC" to "agree with the RFC, then go build the thing it should have recommended." The result is phonetic, a native extension that ships the five encoders core never had, plus the comparison helpers that answer the only question most people actually ask.

Double Metaphone: the successor you can actually ship

Double Metaphone is the algorithm the RFC's own rationale reaches for first, and it is the natural default. It returns two keys, a primary and an alternate, so a name with more than one plausible pronunciation matches on either.

double_metaphone("Schwarzenegger");   // ['primary' => 'XRSN', 'alternate' => 'XFRT']
double_metaphone("Catherine", 3);     // ['primary' => 'K0R',  'alternate' => 'KTR']

The RFC names Metaphone 3 as the other successor. Worth knowing before you reach for it: the Metaphone 3 reference implementation is a commercial product, sold as source for $240, not free software. An older 2009 build sits under a BSD license inside OpenRefine, minus years of accuracy corrections. Double Metaphone is the newest variant you can actually vendor into an open project, which is why it, and not Metaphone 3, is what belongs within reach of every PHP app. My implementation is clean-room from the published algorithm, with Apache Commons Codec used only as a parity oracle for the test vectors.

Beider-Morse: matching across languages, and the GPL trap I had to dodge

Moskowitz and Moskovitz are one surname through two transliterations. Иванов is a name most encoders will not touch. Beider-Morse Phonetic Matching handles both, because it is language-aware. It detects, or is told, the source language family, then applies the right transliteration rules for Slavic, Germanic, Hebrew, and Romance names.

bmpm("Garcia", BMPM_SEPHARDIC, BMPM_EXACT);   // "garsia|gartSa"
bmpm_match("Moskowitz", "Moskovitz");          // true

The hard part is not the code, it is the data. BMPM is thousands of rules, and every obvious source for those rules is GPL. The canonical Beider-Morse PHP reference is GPL-3.0. abydos, the popular Python phonetic library, is GPL-3.0 specifically because it ported that same rule data. Copy either into a BSD project and the whole project turns GPL with it.

The escape is Apache Commons Codec, which ships the identical rule tables under Apache-2.0. I vendored the data from there, kept its license header, and added an Apache-2.0 section to the extension's LICENSE. Same data, clean license, and Commons Codec doubles as the parity oracle, so I know the output matches the de-facto reference. The lesson generalizes past this one extension: with phonetic algorithms, the data carries the license, not the code you wrap around it.

One honest caveat. BMPM is slow. It runs language detection and three rule passes over the input, so it costs roughly 60 times a Double Metaphone call, around 91,000 names a second on my machine. You pick it for recall, not throughput. When you already know the language, passing it explicitly skips detection and buys some of that back.

Daitch-Mokotoff Soundex: the genealogy standard

If you are matching Eastern-European or Ashkenazi surnames, this is the field standard, and core never had it. Daitch-Mokotoff was built for exactly that problem, and it is what genealogy databases actually run. It emits a set of six-digit codes, branching on ambiguous letters so one name can carry several codes at once.

dm_soundex("Auerbach");                     // ['097400', '097500']
dm_soundex_match("Moskowitz", "Moskovitz"); // true

Its rule data comes from Apache Commons Codec too, for the same licensing reason as BMPM.

NYSIIS and Match Rating: two lighter English encoders

Both are cheap single-key encoders for American and English names, useful as a second opinion or an alternate index key. NYSIIS (New York State Identification and Intelligence System) is tuned for American surnames and returns one key. Match Rating Approach (Western Airlines, 1977) produces a compact codex and, unusually, ships its own similarity test instead of relying on key equality.

nysiis("Larson");            // "LARSAN"
nysiis("Larsen");            // "LARSAN"  (same key)
match_rating("Catherine");   // "CTHRN"

Both are clean-room from their published specs. They are the fastest encoders in the set, cheap enough to run as a second key alongside Double Metaphone when you want a little more recall without paying BMPM's cost.

The part I use most: the comparison helpers

The real day-to-day question is never "encode this name." It is "do these two names sound alike," and every algorithm answers it differently. That difference is where userland code quietly gets phonetic matching wrong.

Double Metaphone gives two keys; you match on primary agreement, weaker if only an alternate crosses. Daitch-Mokotoff and BMPM give sets; you match on intersection, not equality. Match Rating has a length-and-rating threshold that plain codex comparison skips entirely. Get that logic wrong and you either miss real matches or wave through garbage. So the extension ships one comparison helper per algorithm, each encoding the correct test for that encoder:

double_metaphone_match("Catherine", "Kathryn");   // 2  (primary keys agree)
double_metaphone_match("Vagner", "Wagner");        // 1  (only an alternate crosses)
dm_soundex_match("Moskowitz", "Moskovitz");        // true
bmpm_match("Peterson", "Petersen");                // true
nysiis_match("Smith", "Schmit");                   // true  (both SNAT)
match_rating_compare("Catherine", "Kathryn");      // true

double_metaphone_match() returns 2, 1, or 0, so you can rank match strength instead of treating it as a coin flip. The set-based helpers return a bool on intersection. match_rating_compare() applies the threshold the algorithm actually specifies. You call one function and get the right answer for that encoder, rather than reimplementing set intersection in a loop and getting the edge cases subtly wrong.

For a one-off check, that is the whole API. For repeated lookups against a fixed corpus, encode once and index the keys, then query by encoded value instead of comparing pair by pair:

$index = [];
foreach ($records as $id => $name) {
    foreach (dm_soundex($name) as $code) {   // index every code in the set
        $index[$code][] = $id;
    }
}
$hits = $index[dm_soundex("Moskovitz")[0]] ?? [];

Which one do I reach for?

Double Metaphone as the fast general-purpose default. BMPM when names cross languages or scripts and you can afford the cost. Daitch-Mokotoff for Eastern-European and Jewish genealogy, where it is the standard. NYSIIS or Match Rating as a cheap second key. The relative costs matter when you are encoding at scale:

encoder	relative speed	strongest for
`match_rating()`	fastest (0.24x)	English names; ships its own similarity test
`nysiis()`	fast (0.42x)	American and English surnames
`double_metaphone()`	fast (1.0x baseline)	general Latin-script names
`dm_soundex()`	middle (~2.3x)	Eastern-European and Ashkenazi surnames
`bmpm()`	slowest (~60x)	cross-language and transliteration variants

Two limits worth stating plainly. These are heuristic, culture-bound encoders. They target Latin-script names, and for BMPM and Daitch-Mokotoff, specific language families. They are not a universal global-name solution. And Greek-script input has a known limitation: capitals are not lowercased, because the context-sensitive final-sigma rule cannot be expressed as a point-wise case map, so Greek names need to be passed lowercased or romanized.

Where I landed

I set out to defend metaphone() and finished by agreeing it should be deprecated. The function is the weakest version of an idea that produced far better tools over the thirty years since. The flaw in the deprecation story is not the deprecation. It is pointing at a slower userland library as the replacement for a native one. Phonetic encoding runs in a hot loop, so it belongs in C.

The five encoders core never shipped are now one native extension, with the "do these sound alike" helpers most callers actually need:

pie install iliaal/phonetic

github.com/iliaal/phonetic

metaphone() is the oldest, least accurate version of an idea that kept evolving

"Use a userland library" is the wrong replacement

Double Metaphone: the successor you can actually ship

Beider-Morse: matching across languages, and the GPL trap I had to dodge

Daitch-Mokotoff Soundex: the genealogy standard

NYSIIS and Match Rating: two lighter English encoders

The part I use most: the comparison helpers

Which one do I reach for?

Where I landed

Guide to PHP Security

Search

Categories

Syndicate

Archives