I generate a lot of UUIDs. Primary keys, cache keys, event IDs, request-trace IDs. Probably too many, if I'm honest about it. On a busy request path the same function gets called dozens of times before the response is even assembled, and across a fleet that adds up to a number of UUIDs per second I would rather not write down.

For years that cost was invisible to me, the way a single random_bytes() call is invisible until you make a few billion of them. Then it showed up in a profile, sitting higher than it had any right to, and I started paying attention to where the time actually went. It went to two places: pulling fresh entropy from the kernel once per UUID, and formatting 16 raw bytes into the 36-character canonical string. Both are cheap. Neither is free. Multiply by "too many" and you get a real slice of CPU spent doing nothing but minting identifiers.

So I wrote fast_uuid, a PHP extension that does UUID generation in pure C. This post is about why the two existing options each left a gap, what the extension does differently, and what it costs you.

The two options I already had, and where each one stopped

PHP developers generating UUIDs today reach for one of two things, and both are good at what they do.

The PECL uuid extension wraps libuuid from util-linux. It is fast at version 1 (time-based) UUIDs, because libuuid has a tuned generator for them. But it has three problems for my use case. Its version coverage is partial: v1 and v4 are reliable, v3 and v5 live behind separate uuid_generate_md5() / uuid_generate_sha1() functions, v6 and v7 only compile in if you built against a recent enough libuuid, and v8 is absent entirely. Its v4 path is slow, because it asks the kernel for fresh entropy on every call. And its API looks nothing like the ramsey/uuid code most projects are already written against, so adopting it means rewriting every call site.

There is also a licensing wrinkle worth stating precisely, because it is easy to get wrong. libuuid itself is BSD-3-Clause and permissive. The PECL uuid extension that binds it, however, is LGPL-2.1-or-later. If you only ever apt install the package that's no issue, but the moment you want to vendor the binding, ship it inside a permissively-licensed product, or relicense around it, the copyleft on the PHP binding is friction the BSD library underneath does not have.

The other option is ramsey/uuid, and it is excellent. MIT-licensed, every RFC version, ULID support, an API that has become the de facto standard in the Laravel and Symfony worlds. I have used it for years and have no complaints about its correctness or its ergonomics. The one thing it can't escape is that it's PHP. Its v4 generator calls random_bytes() once per UUID, and that syscall dominates a job that is otherwise a few dozen nanoseconds of work. Its v1 path is its slowest, because the clock-sequence and node bookkeeping all happen in PHP. And for UUIDv7, which I now reach for on time-ordered primary keys, getDateTime() hands you a DateTimeImmutable at millisecond resolution with no cheap integer-millisecond path to skip the object construction entirely.

So the gap was specific. One option is fast on a narrow set of versions but licensed awkwardly and API-incompatible. The other is broad and beautifully designed but pays a PHP-level cost on the hot path. Nobody covered speed and full RFC 9562 coverage and a familiar API at the same time. I wanted all three.

What "tackle both" actually meant

The design target was three things at once, none of them negotiable.

Full RFC 9562 coverage: versions 1, 2 (DCE Security), 3, 4, 5, 6, 7, 8, plus nil and max. Not "the popular ones." All of them, so the extension is never the reason you can't use a version.

An API a ramsey/uuid user already knows. The object API mirrors ramsey/uuid under the FastUuid namespace, so the cold-path ergonomics are familiar. For the hot path there is a procedural, zero-allocation set of functions (uuid_v4(), uuid_v7(), and friends) that return a zend_string directly with no object to allocate or free.

Pure C, no C++. No libstdc++ to link against, no external UUID library to track for version skew or licensing. The whole thing is BSD-3-Clause, including the entropy and formatting paths, so there is no copyleft binding sitting between you and a permissive license.

The rest of the post is the two optimizations that close the speed gap, the UUIDv7 work, the numbers, and the honest costs.

Why ramsey/uuid spends its time in random_bytes, and what to do instead

The short version: random_bytes() is a syscall, and a syscall per UUID is the bottleneck. fast_uuid makes one kernel entropy request and serves hundreds of UUIDs from it before going back.

Generating a v4 UUID is 16 random bytes with six bits overwritten for version and variant. The randomness is the entire job. In ramsey/uuid that randomness comes from random_bytes(16), which on Linux funnels into the getrandom() syscall. Crossing into the kernel and back costs far more than the handful of nanoseconds it takes to set the version bits, so at scale you're not paying for UUID generation, you're paying for syscalls.

fast_uuid keeps an 8 KB per-thread buffer and fills it with a single getrandom() call. At 16 bytes per v4 UUID that one syscall covers roughly 500 UUIDs before the buffer needs refilling. The entropy is exactly as fresh, it comes from the same kernel CSPRNG, you just amortize the crossing cost across the batch instead of paying it every call. That single change is most of the order-of-magnitude gap on v4.

For callers who want raw speed on identifiers that are not security-sensitive, there is also uuid_v4_fast(), which draws from a xoshiro256** PRNG instead of the kernel. That's a deliberately non-cryptographic generator. It's for ORM keys and trace IDs where unpredictability isn't a security property, and I'll say plainly in the next section why you must never reach for it when it is.

Turning 16 bytes into 32 hex characters without a loop

The other half of the cost is formatting. The answer-first version: converting the binary UUID to its canonical hex string is a per-byte lookup loop in most implementations, and a single SIMD table-lookup instruction can do 16 bytes at once.

The canonical string is the 16 bytes expanded to 32 hex characters with four dashes inserted. Done scalar, that is a loop with two nibble lookups per byte. fast_uuid does it with one vector instruction over the whole 16 bytes: pshufb on x86-64 (SSSE3), vqtbl1q_u8 on ARM64 (NEON), both of which are byte-shuffle table lookups that turn a vector of nibbles into a vector of ASCII hex digits in a single shot. The extension picks which to use at runtime from CPU feature detection, and falls back to a scalar lookup table on anything without those instruction sets. There are no build flags to set and no -march to remember; the extension chooses the right path when it loads.

The object itself is built for the same frugality. It is 16 inline bytes plus a lazily-cached canonical string, with no HashTable and no declared properties, so there is no per-object property storage to allocate and tear down.

UUIDv7 with sub-millisecond ordering, and a thank-you to Ben Ramsey

UUIDv7 is the version I care most about, because it gives you time-ordered, index-friendly primary keys without a separate sortable column. The answer-first claim: fast_uuid keeps v7 UUIDs in correct time order even when many are generated inside the same millisecond, and gives you an integer-millisecond path that skips DateTime entirely.

The v7 layout is a 48-bit Unix millisecond timestamp, then the version and variant bits, then rand_a (12 bits) and rand_b (62 bits). The problem with a plain implementation is that two UUIDs minted in the same millisecond have the same timestamp and random tails, so their sort order within that millisecond is arbitrary. RFC 9562 anticipates this in section 6.2. Its Method 3, "replace leftmost random bits with increased clock precision," puts a sub-millisecond clock fraction into the leftmost bits of rand_a. fast_uuid does exactly that, and adds a monotonic counter in rand_b, so same-millisecond v7s still sort in generation order. Your database index stays happy.

On top of the ordering work there is an integer-millisecond API: uuid_v7_at(int $ms), Uuid::uuid7(int $ms), and getTimestampMillis(). These let you stamp a UUID at a known time, or read its time back, as a plain integer, without constructing a DateTime object on either side. The DateTime accessors that do exist read and write ext/date's internal timelib_time structure directly rather than routing through call_user_function, which is roughly three times cheaper.

Conversations with Ben Ramsey sharpened some of these choices. He was generous with feedback while I was building this, and he shared where he's taking identifiers next with ramsey/identifier, a newer library spanning UUIDs, ULIDs, and Snowflake IDs that points at the direction he sees superseding ramsey/uuid over time. That nudged me toward future-proofing the API surface: alongside the get* method names every ramsey/uuid user knows, fast_uuid also ships to* aliases, so code written today reads the same whichever way the wider convention settles. Credit where it is due. ramsey/uuid set the bar I was building against, and Ben pointing at what comes after it shaped the parts of this extension meant to outlive the current convention.

The numbers

Throughput against ramsey/uuid 4.9.2 and the PECL uuid extension 1.3.0, on PHP 8.4.22 NTS, non-debug, no sanitizers, with the SSSE3 hex formatter active on x86-64. Each operation runs 300,000 iterations after a 20,000-iteration warmup, and the reported figure is the best of 40 runs. Units are million operations per second, higher is better.

Operation fast_uuid (obj) fast_uuid (proc) ramsey/uuid PECL uuid
v4 gen to string 12.6 19.5 1.10 0.47
v1 gen to string 12.3 16.5 0.29 8.22
v7 gen to string 12.1 19.8 0.66 n/a
parse to 16 bytes 10.4 16.2 3.18 5.28

Against ramsey/uuid that is roughly 11.5x to 17.7x on v4, 42x to 57x on v1, 18.3x to 30x on v7, and 3.3x to 5.1x on parsing. The v1 gap is the widest because v1 is ramsey/uuid's slowest path and close to fast_uuid's fastest. Note also that PECL is faster than ramsey/uuid on v1 (8.22 vs 0.29) and that this is the one row where PECL is strong, yet fast_uuid still clears it, while on v4 PECL drops to 0.47 because it asks the kernel for entropy every call.

One honest caveat that I keep in BENCHMARKS.md and will not drop here: the fast_uuid operations are fast enough (around 50 ns) that scheduler noise dominates a single run, so read the fast_uuid columns as order-of-magnitude rather than three-significant-figure, roughly plus or minus 10 percent run to run. The ramsey/uuid (around 900 ns) and PECL (around 2 microseconds) columns reproduce to within about 3 percent. If you cite these numbers, cite them with the comparison set and build: vs ramsey/uuid 4.9.2 and PECL uuid 1.3.0, PHP 8.4.22 NTS non-debug, best of 40 runs. ARM64 NEON numbers and the full timestamp-API table are in the repo.

For a byte-layout-heavy extension the more important number than throughput is correctness. It builds green on PHP 8.1 through 8.6, NTS and ZTS, with zero compiler warnings, and runs clean under AddressSanitizer and UndefinedBehaviorSanitizer across five build configurations.

What it costs you

No optimization is free, and hiding the costs would defeat the point of writing this honestly.

uuid_v4_fast() uses xoshiro256**, which is fast and statistically good but not cryptographically secure. Use it for keys and trace IDs where unpredictability is not a security requirement. Never use it for session tokens, password-reset nonces, or anything an attacker benefits from guessing. The kernel-backed uuid_v4() is right there for those.

The ramsey/uuid compatibility layer, FastUuid\Compat, is a PSR-4 companion package that makes migration largely a use swap. It is not on Packagist yet, so today you install it as a Composer path repository rather than a plain composer require. Adoption is a migration, not a binary swap.

If you supply a custom RandomGeneratorInterface, TimeGeneratorInterface, or NodeProviderInterface, generation intentionally routes off the C fast path, the same way ramsey/uuid lets you override its internals. Your generator wins, and you give up the speedup for those calls. That is the correct trade, but it is a trade.

And getDateTime() reads v7 timestamps back at millisecond precision, matching ramsey/uuid, even though the extension carries sub-millisecond data internally for ordering. The sub-ms fraction exists to keep same-millisecond UUIDs sorted, not to hand you back a nanosecond clock.

Getting it

The extension is BSD-3-Clause, builds on PHP 8.1 through 8.6, and is PIE-installable:

pie install iliaal/fast_uuid

Prebuilt binaries cover Windows x86/x64 (NTS and TS), Linux glibc x86_64 and arm64, and macOS arm64. Source, benchmarks, and the compatibility layer are at github.com/iliaal/fast_uuid.

UUID generation is the kind of cost you never notice until you're doing it a million times an hour, and then it's the kind you can't un-see. fast_uuid gives most of that cost back without asking you to relearn an API or drop a single RFC 9562 version, and it gets a little faster every release.