The Governance Layer
Traditional software says test before you ship. Behavioral agents don't work that way. Two months of production agents and what staying in control actually requires.

Software engineering encoded a rule so deeply we stopped calling it a rule: test before you ship. The CI/CD pipeline exists to enforce this. Gate checks on pull requests, automated test suites, staging environments built to approximate production before anything touches it. Deploy is the finish line. Once something clears those checkpoints, it runs in production and you trust it.
That intuition breaks with behavioral agents. Not partially. Completely.
I’ve been running two fine-tuned agents in production for two months. The foreman delegates, the worker retrieves, the pipeline executes. Infrastructure metrics are clean: sub-500ms latency, zero error rate, tool calls completing within budget. Behavioral eval results from training: 80 and 88 percent pass rates on their respective roles. By every pre-deployment measure, these were systems I understood before I shipped them.
Somewhere in week five, the worker resolved an ambiguity it was supposed to escalate. Logs: clean. Task: complete. Behavior: wrong.
No system prompt violation. No tool call anomaly. No error in any conventional sense. Just a judgment call the agent made in a situation where unilateral judgment calls are exactly what it’s trained to avoid. My infrastructure dashboard didn’t see it. I found it three hours later doing a manual work log review.
The software intuition would call this a QA failure. It isn’t. It’s a governance failure. The distinction matters more than it might look.
Why the Test Doesn’t Transfer
A software test runs the same function against the same input and expects the same output. Deterministic. Stateless. If the test passes at deploy time, it passes forever unless the code changes. The CI gate holds because the thing being tested doesn’t change without a deployment.
Behavioral fine-tuning doesn’t work this way. The model is the deployment. Its behavioral state isn’t fixed at training: context shifts it, inference conditions drift it, production inputs hit it in ways that didn’t exist in your test suite. The eval I built for each role has 250 test cases. 250 test cases can’t cover the input distribution of an agent running real tasks for six weeks.
80 percent in eval doesn’t mean 20 percent failure rate in production. It means: of the specific behavioral patterns I thought to test, the agent satisfied 80 percent. Production brings inputs you didn’t think to test. The eval is a sample, not a proof.
Pre-deployment testing is sufficient for deterministic systems because you can enumerate the behaviors that matter. For a behavioral agent, you can’t enumerate them. You can sample them. And sampling is ongoing work, not a terminal gate.
What Governing in Production Means
Once you accept that the eval framework is a production reference and not a one-time pass/fail, the question becomes what you do with it.
The minimum I can articulate from six weeks of running this:
Behavioral re-runs on a schedule. The 25 test cases per role run against the live agent continuously, not just after retraining. Stability is the signal: the score that was 80 percent at deploy should still be 80 percent four weeks later. A drop to 70 isn’t necessarily a crisis. It’s signal you want before it compounds into something that is.
Output sampling. Reviewing every production output is too slow and too expensive. Ten percent of weekly outputs, scored against a behavioral rubric, gives trend data. Individual scores matter less than direction.
Escalation pattern tracking. This is what caught the week five failure. The worker is trained to escalate ambiguous instructions. I started watching how often it was actually escalating week over week. A worker that escalated 15 ambiguities in week one and escalated 4 in week five isn’t doing less work. It’s suppressing signals. That pattern shows up before the behavioral tests catch it.
None of this is automated in my setup at this time. This is intentional. Manual reviews, weekly cadence, pattern-watching that takes real time and depends on me showing up to do it. (That’s the honest version. The aspirational one is a behavioral drift detector that surfaces the signal before I have to notice the feeling that something’s off. I’ve built the proof-of-concept and it’s being evaluated. Right now I’m running on discipline.)
The Accountability Question
Traditional software accountability has a structure: the PR author owns the change, code review catches errors before merge, QA verifies behavior before release, and the audit trail is in git history.
Agent accountability doesn’t have that structure yet.
When the worker resolved the ambiguity it should have escalated, accountability lived somewhere in the gap between “I trained it to escalate” and “I accepted this output.” I reviewed the work log, found the failure, logged it, decided one failure in a rare edge case wasn’t load-bearing enough to trigger a training cycle.
That decision is the governance. There’s no tooling designed to record it.
What “being in control” of an agent system means, operationally: you have a documented position on what the agent did. You accepted the output and can say why. You flagged it and can say why. Or you didn’t review it and can’t say anything. The third option is the accountability gap that most enterprise agent deployments are sitting in right now, whether they know it or not.
Infrastructure monitoring shows the agent is running. It doesn’t show you’re in control. Those are two different things.

Where the Tooling Is
Infrastructure monitoring is mature. The behavioral observability layer isn’t.
Observability platforms in 2026 are largely designed for general LLM applications: chatbots, document Q&A, support automation. Latency, cost, completion, safety scoring. Some add thin relevance layers. Almost none are built for the specific problem of monitoring fine-tuned behavioral agents where the governance question is whether a trained property is still holding in production.
For my setup, the answer is manual work I’m hoping to eventually automate: weekly behavioral test re-runs, output sampling, escalation pattern tracking, work log reviews. Every piece of it depends on a human deciding to do it.
The gap isn’t philosophical. It’s a tooling problem with a specific shape: to automate behavioral drift detection, you need a production judge that scores outputs against behavioral rubrics reliably, at scale, without requiring a human to initiate every review. The eval framework exists. The judge that runs it continuously, without me, doesn’t.
Infrastructure is running.
The governance layer is running on discipline.
Discipline doesn’t scale.

