They eliminated coding, not developers

StrongDM's 3-person team builds production security software with zero human code.

Mar 09, 2026

💎 Gems of the Week

Official Google Workspace CLI — built for both humans and agents, covering Drive, Gmail, Calendar, and all Workspace APIs. Ships with 40+ pre-built agent skills, making it the reference implementation for CLI-first agentic tooling on a major enterprise platform.

shadcn/cli v4 — Added a full skills system this week. Components are now agent-composable units, not just clipboard copy/paste. The right abstraction for agent-first frontend work.

T3 Code — Desktop app for AI coding agents. GUI layer for agentic workflows, more opinionated than Frontier terminal. It’s a heavily inspired from Codex. Just install it and see if it fits your flow.

Five Level of AI-assisted programming

Dan Shapiro built a five-level taxonomy of AI-assisted programming.

Level 0: IDE AutoComplete

Level 1: Discrete tasks to your AI intern. “Write a unit test for this.” “Add a docstring.”

Level 2: Junior developer harness — This is what looks like standard Code Agent without Setup and methodology.

Level 3: Developer harness — You are the human in the loop, reviewing and directing the AI what to do, what is breaking, what direction to take, sometimes fighting it.

Level 4: Engineering Team — You pushed the harness hard with lot of skills, custom tools, mcp and automate the feedback loops. Your role are to write specifications, testing plan and quality assurance.

Level 5: The Dark Factory — There is no human in the loop needed anymore. The human only configure and tune the Agentic system, but the software factory is running on it’s own, including QA and Test Evaluation.

StrongDM AI Native SWE

Three people build production security software. No human writes or reviews code.
StrongDM is Level 5. For security infrastructure.

Justin McCarthy, Jay Taylor, and Navan Chauhan at StrongDM started this experiment in July 2025. Working prototypes by October. Three months.

StrongDM build access management and security infrastructure — the kind of software that ends companies when it fails. They have to be reliable and secure by design to survive.

They set two rules:
- Code must not be written by humans.
- Code must not be reviewed by humans.

Budget: $1,000/day in tokens per engineer.

How it actually works

The pipeline: specs + scenarios → agents → harnesses → convergence.

“Convergence” means agents run against a Digital Twin Universe — independent clones of production SaaS (Slack, Okta) without rate limits. Swarms of simulated test agents execute scenarios continuously. They don’t check pass/fail. They measure probabilistic satisfaction: a score, a confidence interval, a threshold the team decided the spec demands.

The decision of what “good enough” looks like for a feature is the only human work left. Writing the satisfaction criteria precisely enough that the metric captures what actually matters. That’s the spec. The code is a side effect.

Simon Willison noted what many developers acknowledge: something changed in November 2025. Claude Opus 4.5, GPT 5.2 along the Code Agent reliability crossed a threshold. StrongDM had been building toward it since July. They were ready to push more.

StrongDB has defined their Dark Software Factory in a set of Markdown rules that you can use to replicate their Agentic Setup. In all fairness I’ve look at it and found it’s more a sloppy guide of Agentic Harness setup than a proper efficient Software Factory.

I advocate for specific agentic setup for each project, because stacks, practice, feedback loop testing is different for each project. You need to enforce specific practice. That require care and attention. I’m experienced in test automation as I’ve done that deeply for more than 10 years with Infrastructure as code, ephemeral infra and CI/CDE. AI Engineering was just a natural automation evolution for me.

Where I am

I’m at Level 4. Mostly.

For pure HTML/CSS work: Level 5. The surface area is bounded, the security risk is zero, the model’s attention stays focused. I don’t touch the output.

For anything touching infrastructure, secrets, or cross-service logic: Level 4. I enforce strict Skills and Rules in my agentic harness. One shot is never enough — I run many steps, steering the model across phases. I cross-review architecture decisions with Codex GPT 5.4, but the daily driver is Claude Code Opus 4.6.

The problem nobody names correctly

I’ve had many rounds where the model didn’t test the features it was specifically instructed to test. Not because the instruction was missing. Because it was buried — too many loops, a complex task half-completed, attention stretched across too many concerns at once.

This is attention dilution during prefill. When a prompt crosses too many expertise domains simultaneously, the model can’t weight each one with depth. You get coverage, not precision. You get “tested” in a log comment and a test that ran but didn’t assert anything that would have caught the regression.

StrongDM’s architecture handles this by constraint: the spec is the only instruction. Everything else is generated. A bounded domain means focused weights. If the spec is precise, the model stays sharp.

The AWS story

Someone on X last week: Claude deleted their entire AWS RDS database through a Terraform deployment.

The replies were predictable. “AI is not ready.” “This is why you can’t trust these models.”

Wrong diagnosis.

The problem isn’t Claude. It’s the absence of Hooks. Claude Code lets you define hooks that intercept and deny specific commands — terraform destroy on production, anything irreversible without explicit approval. The developer didn’t configure them. The model did exactly what the instructions asked, inside an environment with no guardrails.

The spec was missing. The harness was missing. The model executed correctly against an incomplete set of constraints. That’s not an AI failure. That’s a system design failure.

The spec is the moat

You’re not being replaced by AI. You’re being replaced by someone who writes better specs.

StrongDM’s bet isn’t that the models are reliable enough to run unsupervised. It’s that the specifications are precise enough to make unsupervised execution safe. That distinction is everything. The first is a model problem. The second is a skills problem.

Stanford Law’s CodeX group put it plainly: “A team building security infrastructure decided human code review is an obstacle, not a safeguard.” The accountability question is real — legal frameworks aren’t prepared for probabilistic satisfaction metrics instead of deterministic correctness. But the answer to that gap isn’t to put humans back on code review. It’s to make the spec sophisticated enough that accountability is traceable.

The spec is where accountability lives now.

If you want to move from Level 3 to Level 4 — and eventually Level 5 on the tasks where that’s safe — the bottleneck isn’t the model. It’s your ability to write instructions that survive a thousand agent iterations without drifting.

That’s what I’m putting in the DevX Course, Agentic Setup, System Design, Specifications and Security.

— Pierre

AI writes the code. You design the system.

Dev3o

Discussion about this post

Ready for more?