They eliminated coding, not developers
StrongDM's 3-person team builds production security software with zero human code.
đ Gems of the Week
Official Google Workspace CLI â built for both humans and agents, covering Drive, Gmail, Calendar, and all Workspace APIs. Ships with 40+ pre-built agent skills, making it the reference implementation for CLI-first agentic tooling on a major enterprise platform.
shadcn/cli v4 â Added a full skills system this week. Components are now agent-composable units, not just clipboard copy/paste. The right abstraction for agent-first frontend work.
T3 Code â Desktop app for AI coding agents. GUI layer for agentic workflows, more opinionated than Frontier terminal. Itâs a heavily inspired from Codex. Just install it and see if it fits your flow.
Five Level of AI-assisted programming
Dan Shapiro built a five-level taxonomy of AI-assisted programming.
Level 0: IDE AutoComplete
Level 1: Discrete tasks to your AI intern. âWrite a unit test for this.â âAdd a docstring.â
Level 2: Junior developer harness â This is what looks like standard Code Agent without Setup and methodology.
Level 3: Developer harness â You are the human in the loop, reviewing and directing the AI what to do, what is breaking, what direction to take, sometimes fighting it.
Level 4: Engineering Team â You pushed the harness hard with lot of skills, custom tools, mcp and automate the feedback loops. Your role are to write specifications, testing plan and quality assurance.
Level 5: The Dark Factory â There is no human in the loop needed anymore. The human only configure and tune the Agentic system, but the software factory is running on itâs own, including QA and Test Evaluation.
StrongDM AI Native SWE
Three people build production security software. No human writes or reviews code.
StrongDM is Level 5. For security infrastructure.
Justin McCarthy, Jay Taylor, and Navan Chauhan at StrongDM started this experiment in July 2025. Working prototypes by October. Three months.
StrongDM build access management and security infrastructure â the kind of software that ends companies when it fails. They have to be reliable and secure by design to survive.
They set two rules:
- Code must not be written by humans.
- Code must not be reviewed by humans.
Budget: $1,000/day in tokens per engineer.
How it actually works
The pipeline: specs + scenarios â agents â harnesses â convergence.
âConvergenceâ means agents run against a Digital Twin Universe â independent clones of production SaaS (Slack, Okta) without rate limits. Swarms of simulated test agents execute scenarios continuously. They donât check pass/fail. They measure probabilistic satisfaction: a score, a confidence interval, a threshold the team decided the spec demands.
The decision of what âgood enoughâ looks like for a feature is the only human work left. Writing the satisfaction criteria precisely enough that the metric captures what actually matters. Thatâs the spec. The code is a side effect.
Simon Willison noted what many developers acknowledge: something changed in November 2025. Claude Opus 4.5, GPT 5.2 along the Code Agent reliability crossed a threshold. StrongDM had been building toward it since July. They were ready to push more.
StrongDB has defined their Dark Software Factory in a set of Markdown rules that you can use to replicate their Agentic Setup. In all fairness Iâve look at it and found itâs more a sloppy guide of Agentic Harness setup than a proper efficient Software Factory.
I advocate for specific agentic setup for each project, because stacks, practice, feedback loop testing is different for each project. You need to enforce specific practice. That require care and attention. Iâm experienced in test automation as Iâve done that deeply for more than 10 years with Infrastructure as code, ephemeral infra and CI/CDE. AI Engineering was just a natural automation evolution for me.
Where I am
Iâm at Level 4. Mostly.
For pure HTML/CSS work: Level 5. The surface area is bounded, the security risk is zero, the modelâs attention stays focused. I donât touch the output.
For anything touching infrastructure, secrets, or cross-service logic: Level 4. I enforce strict Skills and Rules in my agentic harness. One shot is never enough â I run many steps, steering the model across phases. I cross-review architecture decisions with Codex GPT 5.4, but the daily driver is Claude Code Opus 4.6.
The problem nobody names correctly
Iâve had many rounds where the model didnât test the features it was specifically instructed to test. Not because the instruction was missing. Because it was buried â too many loops, a complex task half-completed, attention stretched across too many concerns at once.
This is attention dilution during prefill. When a prompt crosses too many expertise domains simultaneously, the model canât weight each one with depth. You get coverage, not precision. You get âtestedâ in a log comment and a test that ran but didnât assert anything that would have caught the regression.
StrongDMâs architecture handles this by constraint: the spec is the only instruction. Everything else is generated. A bounded domain means focused weights. If the spec is precise, the model stays sharp.
The AWS story
Someone on X last week: Claude deleted their entire AWS RDS database through a Terraform deployment.
The replies were predictable. âAI is not ready.â âThis is why you canât trust these models.â
Wrong diagnosis.
The problem isnât Claude. Itâs the absence of Hooks. Claude Code lets you define hooks that intercept and deny specific commands â terraform destroy on production, anything irreversible without explicit approval. The developer didnât configure them. The model did exactly what the instructions asked, inside an environment with no guardrails.
The spec was missing. The harness was missing. The model executed correctly against an incomplete set of constraints. Thatâs not an AI failure. Thatâs a system design failure.
The spec is the moat
Youâre not being replaced by AI. Youâre being replaced by someone who writes better specs.
StrongDMâs bet isnât that the models are reliable enough to run unsupervised. Itâs that the specifications are precise enough to make unsupervised execution safe. That distinction is everything. The first is a model problem. The second is a skills problem.
Stanford Lawâs CodeX group put it plainly: âA team building security infrastructure decided human code review is an obstacle, not a safeguard.â The accountability question is real â legal frameworks arenât prepared for probabilistic satisfaction metrics instead of deterministic correctness. But the answer to that gap isnât to put humans back on code review. Itâs to make the spec sophisticated enough that accountability is traceable.
The spec is where accountability lives now.
If you want to move from Level 3 to Level 4 â and eventually Level 5 on the tasks where thatâs safe â the bottleneck isnât the model. Itâs your ability to write instructions that survive a thousand agent iterations without drifting.
Thatâs what Iâm putting in the DevX Course, Agentic Setup, System Design, Specifications and Security.
â Pierre
AI writes the code. You design the system.

