security
engineering
pre-prod

What we found before shipping — a pre-prod cleanup story

Most vendors say they harden it before launch. We wrote the list. Nine dev shortcuts caught and filed before the customer plane went near production traffic, plus the rule we used to find them.

26 Apr 2026 · Mark Bakker

Every identity vendor says it. "We harden it before launch." It's a sentence on every security-review page; nobody pushes back; the buyer files it under "they say so." It's not a verifiable claim.

Last week, while landing the customer-plane work, we wrote the list. Nine specific things we found in our own codebase, each one tracked, each one assigned. None of them were vulnerabilities in the formal CVE sense — no exploitable bug, no advisory — but every one of them was the kind of dev shortcut that turns into a production fire two weeks after launch. We caught them all because we went looking.

This post is what we found, and the rule we used to find it.

The rule: bearer everywhere or there's an ADR

The trigger was a code review on the trustgate webhook — an unauthenticated endpoint with a multi-month-old TODO: Add HMAC comment, and a half-finished HMAC dispatcher on the wallet-broker side pointed at a hardcoded default key. Neither side actually verified anything. Both pieces had been written, both had been reviewed, neither would have done its job if pointed at production.

The fix was easy: bearer + scope, end-to-end, the same way the rest of the customer plane authenticates. But the fix surfaced a question: how many other places in the repo had we taken the "we'll harden it later" shortcut?

We wrote it down as a rule:

Every customer-plane service authenticates via OAuth bearer + scope minted by the authorization hub. Static admin keys, per-tenant API keys, HMAC signing keys, and any other ad-hoc auth scheme that bypasses the hub are pre-prod artefacts and are scheduled for removal. Deviations require an ADR with explicit rationale.

That rule, applied to the repo, turned a security review into a punch list.

What the scan found

Three security migrations were direct consequences of the rule. The trustgate webhook was the trigger, but trust-registry's admin endpoints had the same shape — a static X-Admin-Key header with a hardcoded dev default — and wallet-broker's tenant-facing API used per-tenant API keys instead of OAuth bearer tokens. Three services, three migrations to the same auth model. All three are merged. None of them shipped a deprecation window — pre-prod is the right time to flip cleanly.

But the broader scan turned up six more items that aren't strictly auth problems but share the same "works in dev, fails silently in prod" quality:

Plaintext client secrets in seed migrations. Several seed migrations had clients with demo-secret-style passwords stored verbatim. Fine for local dev. If those migrations run in production, those clients exist in prod with secrets that anyone with repo access can read.

Dev-key defaults that warn but don't enforce. Four services had configuration of the form "use the env var, or fall back to a well-known dev key." If the operator forgets to set the env var, the service runs with the well-known dev key. That's a classic everything's fine until it's not trap — most of the time the env var is set; one Tuesday someone forgets.

In-memory stubs masquerading as production code. An in-process store for RFC 9126 PAR requests. An in-process per-IP rate-limit window. A no-op TLS provisioning service that skips the work with a warning log. Each works fine on a laptop and fails silently the moment you scale past one pod, because the second pod doesn't see the first pod's state.

Cloud-wallet local AES encryption with a hardcoded dev key. The fix is Vault Transit envelope encryption — the same module the JWT signing already goes through. Same shape as the OATH MFA seed-encryption work that landed earlier.

Deprecated Vault extension functions without a removal target. Deprecation annotations with no removal date and no migration target are technical debt that compounds. Either they get moved to where the comment says they belong, or they get deleted, or they get un-deprecated with a written reason.

OWASP Dependency Check disabled — not because we turned it off, but because it returns 429 against NIST's NVD without an API key. The scanner runs but produces no CVE coverage. Until the API key is wired up, our CVE matching is effectively off.

The cache classification rule

Halfway through, a reviewer asked a sharper question: "why is the new per-pod token cache using an in-memory store when our coding conventions say caches must be multi-node-safe?" The answer is real but it doesn't fit on a checkbox: per-pod is fine when each pod independently produces a correct result. The hub doesn't have a per-client mint cap; if pod A and pod B each mint a token, both are valid. Token caches are not split-brain — they're an optimization.

But that distinction needs to live somewhere durable. So we ran an audit: every in-process cache in the codebase gets classified as MUST move to Redis (correctness depends on cluster consistency — rate limits, nonces, session tokens, JWKS rotation), per-pod is OK (token caches, memoized pure functions), or REMOVE (dead). That table goes into an ADR; the rule goes into our coding conventions; a CI lint flags new in-process cross-request caches and asks the author to classify them.

The point is not that the per-pod judgement is wrong. The point is that the next reviewer should be able to read the rule and know — at review time — which class their cache belongs to.

What we didn't do

This is not a security-incident post. None of the items above were exploitable today; the customer plane had not been deployed yet. The list is what would have been a problem if we'd shipped without looking.

We did not write a recovery plan, because we have nothing to recover. We did not rotate any secrets, because no production secrets existed. We did not call vendors, because nobody else's code is involved. The whole list is internal.

What we did do is write down all nine items, group them under one launch-readiness checklist, attach CI guards to each pattern (no new plaintext seed secrets, no new dev-key defaults, no new unclassified cross-request in-process cache), and decide that the customer plane doesn't go to production until the checklist is green.

The bar isn't "we'll harden it before launch." The bar is "the launch-readiness checklist is this list."

If you're evaluating a vendor, that's a more interesting question to ask than the marketing page version. "Do you have a list?" is a different conversation than "is your platform secure?". If the answer to the first one is no, the answer to the second one isn't going to be useful.