The Open Service Reliability Manifest
A specification and ecosystem for declaring, enforcing, and measuring service reliability as code.
# service.reliability.yaml
apiVersion: opensrm/v1
kind: ServiceReliabilityManifest
metadata:
name: payment-api
team: payments
tier: critical
spec:
type: api
slos:
availability:
target: 0.9995
window: 30d
latency:
p99: 200ms
target: 0.995
dependencies:
- service: database
critical: true
expects:
availability: 0.99999Reliability requirements are scattered across wikis, runbooks, Slack threads, and tribal knowledge. Services ship to production without defined SLOs, ownership, or operational readiness. Reliability decisions happen in postmortems instead of before deployment.
We have version control for code (Git), infrastructure (Terraform), and policy (OPA). We're missing version control for reliability.
OpenSRM defines reliability requirements in a single manifest that travels with your service:
OpenSRM is the foundation for a complete operational reliability stack:
| Component | Purpose | Status |
|---|---|---|
| Specification | Core manifest schema | Stable |
| ai-gate Extension | AI decision services | Stable |
| Judgment SLOs | Decision quality metrics | Documented |
| GitHub Action | CI/CD validation | Available |
| nthlayer-learn | Data primitive for AI judgments | Implemented |
| NthLayer | Reliability-as-code CLI tool | Alpha |
| nthlayer-measure | Quality measurement + governance | Implemented |
| Change Events | OTel semantic conventions | Drafted |
| Decision Telemetry | OTel semantic conventions | Drafted |
| nthlayer-correlate | Pre-correlation agent | Architecture |
| nthlayer-respond | Multi-agent incident response | Architecture |
See STATUS.md for detailed progress.
OpenSRM separates what you promise externally (contracts) from what you measure internally (SLOs):
spec:
contract:
availability: 0.999
latency:
p99: 300ms
slos:
availability:
target: 0.9995 # Internal target is tighter
window: 30dDeclare dependencies with expected guarantees. Tools can calculate your maximum achievable SLO.
dependencies:
- service: postgresql
critical: true
expects:
availability: 0.9995
- service: user-service
critical: true
manifest: https://github.com/org/user-service/blob/main/service.reliability.yaml
expects:
availability: 0.999
latency:
p99: 100msNever wonder who owns a service or how to reach them.
ownership:
team: payments
slack: "#payments-team"
escalation: payments-oncall
pagerduty:
service_id: PXXXXXX
runbook: https://wiki.example.com/payment-api-runbookDeclare what metrics, dashboards, and alerts must exist.
observability:
metrics:
required:
- http_server_request_duration_seconds
- http_server_requests_total
dashboards:
required: true
alerts:
required: trueBlock deploys based on error budgets, SLO compliance, or recent incidents.
deployment:
gates:
error_budget:
enabled: true
threshold: 0
slo_compliance:
enabled: true
min_compliance: 0.99
recent_incidents:
enabled: true
lookback: 7d
max_p1: 0Define standard configurations once and inherit across services:
# Template definition
apiVersion: opensrm/v1
kind: Template
metadata:
name: api-critical
spec:
slos:
availability:
target: 0.9999
# Service inherits from template
metadata:
name: checkout-api
template: api-criticalFor AI-powered decision systems, OpenSRM supports judgment SLOs that measure decision quality, not just uptime. A layered maturity model supports incremental adoption -- from basic reversal tracking through audit sampling, outcome-based ground truth, and segment-level analysis.
spec:
type: ai-gate
slos:
judgment:
reversal:
rate:
target: 0.05
window: 30d
high_confidence_failure:
target: 0.02
confidence_threshold: 0.9
audit:
enabled: true
sample_rate: 0.10
accuracy:
target: 0.95
window: 30d
escalation:
rate:
min: 0.05
max: 0.30See Judgment SLOs for the full framework.
OpenSRM proposes OTel semantic conventions for operational signals:
Standardized schema for operational changes (deploys, config, feature flags). These enable correlation between changes and incidents.
# OTel Event
name: change
attributes:
change.id: chg-deploy-001
change.type: deployment
change.scope.service: payment-service
change.timestamp: "2026-02-20T14:25:00Z"See conventions/change-events/.
Standardized schema for AI/human decisions and their outcomes.
# AI makes a decision
gen_ai.decision.id: dec-001
gen_ai.decision.value: approve
gen_ai.decision.confidence: 0.87
# Human overrides
gen_ai.reversal.decision_id: dec-001
gen_ai.reversal.type: human_override
gen_ai.reversal.new_value: request_changesSee conventions/decision-telemetry/.
# service.reliability.yaml
apiVersion: opensrm/v1
kind: ServiceReliabilityManifest
metadata:
name: my-service
team: my-team
tier: standard
spec:
type: api
slos:
availability:
target: 0.999
window: 30d# Using the OpenSRM GitHub Action (recommended)
# See GitHub Action section below
# Or using a JSON Schema validator
npx ajv validate -s spec/v1/schema.json -d service.reliability.yaml
# Or using NthLayer (reference implementation)
nthlayer validate service.reliability.yaml# Check if the service meets its declared requirements
nthlayer check --manifest service.reliability.yaml
# Gate a deployment
nthlayer check-deploy --manifest service.reliability.yaml --exit-on-failureValidate OpenSRM manifests in your CI/CD pipeline with the official GitHub Action.
- uses: rsionnach/opensrm@v1
with:
manifest: 'service.reliability.yaml'| Input | Description | Required | Default |
|---|---|---|---|
manifest |
Path to manifest file (supports glob patterns) | Yes | - |
schema-version |
Schema version to validate against | No | v1 |
strict |
Fail on warnings (missing recommended fields) | No | false |
| Output | Description |
|---|---|
valid |
Whether all manifests are valid (true/false) |
validated-count |
Number of manifests validated |
warnings-count |
Number of warnings generated |
name: Validate OpenSRM
on:
push:
paths:
- '**/*.reliability.yaml'
- '**/service.reliability.yaml'
pull_request:
paths:
- '**/*.reliability.yaml'
- '**/service.reliability.yaml'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate OpenSRM manifests
uses: rsionnach/opensrm@v1
with:
manifest: '**/*.reliability.yaml'
strict: 'false'| Use Case | Description |
|---|---|
| Pre-deployment validation | Check that metrics, dashboards, and alerts exist before shipping |
| SLO feasibility checks | Validate that targets are achievable given dependencies |
| Drift detection | Alert when declared vs. actual reliability diverges |
| Deployment gating | Block releases when error budgets are exhausted |
| Service catalog enrichment | Feed reliability metadata into Backstage, Cortex, etc. |
| Audit & compliance | Prove that services meet reliability standards |
| Traditional Approach | OpenSRM |
|---|---|
| SLOs in wiki pages | SLOs in version-controlled YAML |
| Ownership in tribal knowledge | Ownership declared and discoverable |
| Dependencies undocumented | Dependencies explicit with criticality |
| Observability requirements assumed | Observability requirements enforced |
| "Is this ready?" = opinion | "Is this ready?" = schema validation |
| Resource | Description |
|---|---|
| Full Specification | Complete OpenSRM schema reference |
| JSON Schema | For validation tooling |
| Judgment SLOs | AI decision quality framework |
| Examples | Real-world OpenSRM manifest examples |
| Architecture | Ecosystem architecture and data flows |
| Shift-Left Reliability Skill | Claude Code skill for generating manifests |
| Contributing | How to contribute |
| Governance | RFC process for spec changes |
Tools that implement OpenSRM:
| Tool | Type | Status |
|---|---|---|
| NthLayer | CI/CD enforcement | Reference implementation |
Building a tool that implements OpenSRM? Add it to the list.
- Schemas + enforcement -- Every component is defined by a specification first. Implementation follows. Define contracts, then validate them.
- Shift-left reliability -- Reliability concerns move earlier in the lifecycle. Service manifests define SLOs before deployment. CI/CD gates enforce contracts.
- Operator-agnostic -- The stack supports both human and AI operators. nthlayer-correlate snapshots work for dashboards (human) and LLMs (AI). Decision telemetry captures human and AI decisions equally.
- Open standards -- Extend existing standards (OTel) rather than invent new ones. Works with Prometheus, Datadog, or any backend.
- Reasoning boundary -- Agent capabilities are reserved for components that require interpretation of ambiguous inputs. Deterministic operations (validation, generation, arithmetic) remain as tools. If a component doesn't need to reason, it isn't an agent.
See ARCHITECTURE.md for the full ecosystem architecture.
| Standard | Relationship |
|---|---|
| OpenSLO | Complementary -- OpenSRM adds service context around SLO definitions |
| OpenTelemetry | Extends -- Change events and decision telemetry as OTel conventions |
| Kubernetes | Aligned -- Manifest structure follows K8s conventions |
| Backstage | Integrates -- Manifests can populate service catalogs |
We welcome contributions to OpenSRM. See CONTRIBUTING.md for guidelines.
Major changes go through the RFC process described in GOVERNANCE.md.
Apache License 2.0 -- See LICENSE
OpenSRM builds on ideas from:
- OpenSLO -- SLO specification
- OpenTelemetry -- Semantic Conventions
- Google SRE Handbook -- SLO/SLI concepts
- Backstage -- Service catalog patterns