Skip to content
View rsionnach's full-sized avatar

Block or report rsionnach

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
rsionnach/README.md

Hi, I'm Rob πŸ‘‹

Senior SRE β€’ AI Reliability β€’ Open Source
Creator of the OpenSRM ecosystem


πŸ’‘ The Thesis

Reliability engineering and AI are on a collision course, and both sides need each other.

Traditional SRE gave us SLOs, error budgets, and deployment gates for deterministic systems. But AI agents make decisions that can't be validated with unit tests. A code review bot with 99.9% availability can still approve PRs with critical security vulnerabilities dozens of times a day, and nobody's tracking that.

Meanwhile, AI systems are being deployed into production without the reliability practices that every other critical service takes for granted. No SLO contracts. No dependency math. No deployment gates based on decision quality.

I'm building the tooling that connects these two worlds: bringing SRE discipline to AI systems, and extending SRE for the judgment-quality problems that AI introduces.

I wrote about this: Your AI Agent Is Available, Fast, and Making Terrible Decisions (judgment SLOs), OpenSRM: An Open Specification for Service Reliability, and Shift-Left Reliability.


πŸ”§ The OpenSRM Ecosystem

OpenSRM is an open specification for declaring service reliability requirements as code, including judgment SLOs for AI decision quality. The spec is the shared contract that every component reads.

Five independent tools compose through the spec without depending on each other:

NthLayer β€” Generate your entire monitoring stack from a single YAML manifest. Prometheus rules, Grafana dashboards, PagerDuty configs, deployment gates. Deterministic, no AI required.

Arbiter β€” Universal quality measurement for AI agent output. Point it at your agents, it tells you which ones are producing good work and which are silently degrading. Tracks per-agent quality trends, self-calibrates through human correction signals, and governs agent autonomy.

SitRep β€” Situational awareness at enterprise scale. Pre-correlates millions of observability signals continuously so that when something breaks, the correlated picture is already built. Seconds to a situation snapshot, not minutes of manual dashboard correlation.

Mayday β€” AI-coordinated incident response. Specialised agents handle triage, investigation, communication, and remediation under human supervision. Findings flow back into the ecosystem so the system learns from every incident.

Each tool works alone. Together they form a complete reliability lifecycle: define (OpenSRM) β†’ generate (NthLayer) β†’ measure (Arbiter) β†’ correlate (SitRep) β†’ respond (Mayday) β†’ learn (back to OpenSRM).


πŸ—οΈ GasTown Contributions

The ecosystem's quality measurement concepts were proven inside GasTown (Steve Yegge's multi-agent workspace manager):

  • Guardian β€” Quality-review layer for the internal merge pipeline, implemented as a Deacon plugin. Scores per-worker output, tracks quality trends, alerts on degradation. Fix-merged to main by Steve Yegge.
  • Feed problems view refactor β€” Replaced tmux scraping with structured beads-based health detection. Merged via dual-model review (5 iteration passes).

🧭 Architecture Principles

Zero Framework Cognition (ZFC): Transport is code. Judgment is model. Code handles deterministic transformation. The model handles interpretation. Every component in the ecosystem follows this boundary.

The spec is the integration layer: Components don't import each other's code. They all read OpenSRM manifests and emit OTel telemetry following shared semantic conventions. Adopt one tool or all five.

Independence is the feature: Unlike platforms where you must adopt everything to get value, each component solves a complete problem alone.


πŸ“« Connect

Pinned Loading

  1. nthlayer nthlayer Public

    Generate the complete reliability stack from a service spec in 5 minutes. Dashboards, alerts, SLOs, PagerDuty - zero toil.

    Python 17 1

  2. nthlayer-measure nthlayer-measure Public

    Universal quality measurement engine for AI agent output. Part of the OpenSRM ecosystem.

    Python 2

  3. opensrm opensrm Public

    An open specification for declaring service reliability requirements as code. Define SLOs, dependencies, ownership, and observability in version-controlled YAML.

    HTML 1

  4. nthlayer-correlate nthlayer-correlate Public

    Situational awareness through automated signal correlation. Part of the OpenSRM ecosystem.

    Python

  5. nthlayer-learn nthlayer-learn Public

    The atomic unit of AI judgment β€” structured records for tracking AI decision quality

    Python

  6. nthlayer-respond nthlayer-respond Public

    Multi-agent incident response coordinated by AI. Part of the OpenSRM ecosystem.

    Python