Senior SRE β’ AI Reliability β’ Open Source
Creator of the OpenSRM ecosystem
Reliability engineering and AI are on a collision course, and both sides need each other.
Traditional SRE gave us SLOs, error budgets, and deployment gates for deterministic systems. But AI agents make decisions that can't be validated with unit tests. A code review bot with 99.9% availability can still approve PRs with critical security vulnerabilities dozens of times a day, and nobody's tracking that.
Meanwhile, AI systems are being deployed into production without the reliability practices that every other critical service takes for granted. No SLO contracts. No dependency math. No deployment gates based on decision quality.
I'm building the tooling that connects these two worlds: bringing SRE discipline to AI systems, and extending SRE for the judgment-quality problems that AI introduces.
I wrote about this: Your AI Agent Is Available, Fast, and Making Terrible Decisions (judgment SLOs), OpenSRM: An Open Specification for Service Reliability, and Shift-Left Reliability.
OpenSRM is an open specification for declaring service reliability requirements as code, including judgment SLOs for AI decision quality. The spec is the shared contract that every component reads.
Five independent tools compose through the spec without depending on each other:
NthLayer β Generate your entire monitoring stack from a single YAML manifest. Prometheus rules, Grafana dashboards, PagerDuty configs, deployment gates. Deterministic, no AI required.
Arbiter β Universal quality measurement for AI agent output. Point it at your agents, it tells you which ones are producing good work and which are silently degrading. Tracks per-agent quality trends, self-calibrates through human correction signals, and governs agent autonomy.
SitRep β Situational awareness at enterprise scale. Pre-correlates millions of observability signals continuously so that when something breaks, the correlated picture is already built. Seconds to a situation snapshot, not minutes of manual dashboard correlation.
Mayday β AI-coordinated incident response. Specialised agents handle triage, investigation, communication, and remediation under human supervision. Findings flow back into the ecosystem so the system learns from every incident.
Each tool works alone. Together they form a complete reliability lifecycle: define (OpenSRM) β generate (NthLayer) β measure (Arbiter) β correlate (SitRep) β respond (Mayday) β learn (back to OpenSRM).
The ecosystem's quality measurement concepts were proven inside GasTown (Steve Yegge's multi-agent workspace manager):
- Guardian β Quality-review layer for the internal merge pipeline, implemented as a Deacon plugin. Scores per-worker output, tracks quality trends, alerts on degradation. Fix-merged to main by Steve Yegge.
- Feed problems view refactor β Replaced tmux scraping with structured beads-based health detection. Merged via dual-model review (5 iteration passes).
Zero Framework Cognition (ZFC): Transport is code. Judgment is model. Code handles deterministic transformation. The model handles interpretation. Every component in the ecosystem follows this boundary.
The spec is the integration layer: Components don't import each other's code. They all read OpenSRM manifests and emit OTel telemetry following shared semantic conventions. Adopt one tool or all five.
Independence is the feature: Unlike platforms where you must adopt everything to get value, each component solves a complete problem alone.
- OpenSRM Ecosystem: github.com/rsionnach/opensrm
- Articles: Judgment SLOs β’ OpenSRM Spec β’ Shift-Left Reliability
- LinkedIn: rob-fox-29a29024
- GasTown Discord: Active contributor in the Wasteland community



