I Forked a Dead PHP Name Parser Because It Couldn't Tell a Credential From a Surname

I objected to deprecating... pdo_duckdb: DuckDB for PH...

Here is a bug in a name-parsing library I use at work:

// theiconic/name-parser
$name = $parser->parse('Jane Doe DDS');
$name->getLastname();    // "Dds"
$name->getMiddlename();  // "Doe"

The dental credential is now her last name. The real surname got shoved into the middle-name field. Every row with a trailing credential and no comma had some version of this, and in a list of clinicians that is most of them.

The library is theiconic/name-parser, a small, genuinely useful PHP package that splits a full-name string into salutation, first name, initials, last name, suffix, and nickname. I use it at work. It does the boring parts well. But the upstream repo went quiet around 2020, and bugs like the one above never got fixed.

So I forked it. Today I'm releasing iliaal/nameparser, a maintained fork that fixes the credential handling, adds a confidence signal for the cases it genuinely can't decide, and targets PHP 8.3+. Most of the fix is unglamorous boundary work. One part of it rests on a single idea worth writing down: the parser was throwing away the one signal that tells a credential from a name.

When the Credential Is Also a Name

There are two ways a credential ends up in the wrong field, and they are not the same bug.

The easy one is Jane Doe DDS. DDS is never a name, so the only question is where the token boundary sits. Upstream, with no comma to anchor it, takes the last token as the surname and never asks whether it is actually a trailing credential. The fork checks, recognizes DDS as a suffix, and keeps Doe:

// fork
$name = $parser->parse('Jane Doe DDS');
$name->getLastname();  // "Doe"
$name->getSuffix();    // "DDS"

The hard one is when the token is both a name and a credential. Ma is a surname. MA is a master's degree. Do is a Vietnamese surname and a doctor of osteopathic medicine. Feed upstream a comma form where the given name is one of these and the name disappears:

// theiconic/name-parser
$name = $parser->parse('Smith, Ma');
$name->getFirstname();  // ""
$name->getSuffix();     // "MA"

The given name Ma was stripped into the suffix as the credential MA. The field is now empty. Nothing threw, nothing warned; a person's first name was deleted because it collided with a degree abbreviation.

Casing Is the Signal Your Parser Throws Away

Upstream keys every token through strtolower() before matching it against its credential dictionary. Once Ma becomes ma, it matches MA and gets stripped. The piece of information that would have saved it, the capital pattern, was deleted on the first line.

People write credentials in capitals and names in title case. Smith, Ma is a person named Ma. Smith, MA is someone with a master's degree and no recorded first name. The capitalization is not decoration; it is the writer's own distinction between the two, and lowercasing throws it away before anything looks at it.

The fork stops throwing it away. An ambiguous token, one that collides with both a name and a credential, is treated as a credential only when it is written in all caps. Title case or lower case keeps it as a name:

// fork
$parser->parse('Smith, Ma')->getFirstname();  // "Ma"   title case, kept as a name
$parser->parse('Smith, MA')->getFirstname();  // ""     all caps, read as the credential MA

The same idea fixes a second class of mangling. Under all-caps input there is no case left to mark a two-letter token as a set of initials, so upstream guesses wrong and splits a short given name down to one letter:

$parser->parse('JO ANDERSON')->getFirstname();
// upstream: "J"   (JO read as the initial J)
// fork:     "Jo"

Same dictionary, same tokens. The difference is that casing now decides the ambiguous cases instead of being discarded before the decision is made.

This is a small change with a large blast radius, because the failure was silent. Upstream didn't throw. It wrote a wrong field and moved on, which is exactly the kind of bug you find six weeks later in a report nobody can reconcile.

When Casing Can't Decide: A Confidence Signal

Casing only works when there is casing to read. Uniform-case input, an all-caps legacy export or an all-lowercase dump, carries no signal at all. NGUYEN, VI could be the surname Nguyen with the given name Vi, or the surname Nguyen with the credential VI, and nothing in the string tells you which. The parser has to pick a default, and a default is a guess.

For a one-off parse, a guess is fine. For a batch import of a few hundred thousand person records, a silent guess is a data-integrity problem you won't notice until it matters. So the fork adds an advisory pass that tells you when the input was undecidable:

use Iliaal\NameParser\Confidence;

$result = Confidence::assess('NGUYEN, VI');
// [
//   'ambiguous' => true,
//   'notes' => ["'VI' could be a name or a credential; input casing is uniform"],
// ]

if ($result['ambiguous']) {
    // route this row to manual review instead of trusting the split
}

The same signal is available on the parsed result, derived from the same input the parser saw:

$parser->parse('NGUYEN, VI')->getConfidence();  // ['ambiguous' => true, 'notes' => [...]]

getConfidence() is read-only. It doesn't change what parse() returns; it's a second opinion you opt into. A mixed-case Nguyen, Vi stays unflagged, because the title-case Vi already resolved to a given name. The flag fires only when the casing genuinely could not decide.

Here is the limitation, stated plainly, because the post that hides it is the post you stop trusting: this is a heuristic keyed on casing, and uniform-case data has no casing to key on. On an all-caps dataset, an ambiguous trailing token still reads as a credential by default. What the confidence pass buys you is a queue. It flags the uniform-case rows where the token plausibly collides with a real name, so you can review those instead of trusting all of them. It does not flag clean credentials that are not also names, RN, PT, OD, because flagging every one of those would drown the review queue on exactly the all-caps data where review matters most. An ambiguous => false on all-caps input is not a correctness guarantee. It means no name-collision was detected, not that the split is definitely right.

What Else Changed

The casing work is the headline, but a maintained fork is also a place to fix the smaller things that accumulate in a dormant library.

The toArray() method returns every part under a fixed key set, with an empty string for any part that is absent. Upstream's getAll() omits empty parts and varies its keys, so consuming it means existence-checking every field. toArray() is a stable shape you can hand to json_encode() or a DTO without guards:

$parser->parse('Dr. Jane A. Doe DDS')->toArray();
// [
//   'salutation' => 'Dr.', 'firstname' => 'Jane', 'initials' => 'A.',
//   'middlename' => '', 'lastname_prefix' => '', 'lastname' => 'Doe',
//   'suffix' => 'DDS', 'nickname' => '', 'given_name' => 'Jane A.',
//   'full_name' => 'Jane A. Doe',
// ]

The credential dictionary grew teeth for healthcare data specifically. Beyond the standard academic and professional suffixes, the fork adds nursing and allied-health credentials, RN, NP, PharmD, APRN, PA-C, OTR/L, and thirty-odd more, mined by frequency from the NPI registry. If you parse clinician names, a trailing credential no longer leaks into the first name.

A handful of robustness fixes round it out. An unclosed nickname delimiter no longer swallows the surname: upstream parses John (Bob Smith to a last name of John, the fork keeps Smith. A lone bracket or quote token returns an empty Name instead of crashing parse() with a TypeError. Config setters take effect on a reused parser even when called after the first parse(). Everything after a second comma is kept as a middle name, so Smith, John, Robert keeps Robert while Smith, MD, PhD still strips to suffixes.

The fork targets PHP 8.3+ and is tested through 8.5, runs clean under PHPStan level 9, and the full upstream getter surface is unchanged. It's additive. If you were on the original, the methods you called still return what they returned, minus the credential bug.

Credit Where It's Due

This is a fork, not original work, and the lineage matters. The parser core is The Iconic's. The modernization to PHP 8.3+ that I built on came from Zachary Miller's fork. What I added is the casing-and-credential layer, the confidence signal, and the robustness fixes above. I'm standing on two other people's work and I'd rather say so than pretend the whole thing sprang from nothing.

Install it:

composer require iliaal/nameparser

The repo, with the full changelog and the all-caps limitation documented in the README, is at github.com/iliaal/nameparser. If you parse names from professional or registry data and you have ever found a credential sitting in a surname column, this is the fix.

When the Credential Is Also a Name

Casing Is the Signal Your Parser Throws Away

When Casing Can't Decide: A Confidence Signal

What Else Changed

Credit Where It's Due

Guide to PHP Security

Search

Categories

Syndicate

Archives