OpenSRM

The Open Service Reliability Manifest

A specification and ecosystem for declaring, enforcing, and measuring service reliability as code.

TL;DR

# service.reliability.yaml
apiVersion: opensrm/v1
kind: ServiceReliabilityManifest
metadata:
  name: payment-api
  team: payments
  tier: critical

spec:
  type: api

  slos:
    availability:
      target: 0.9995
      window: 30d
    latency:
      p99: 200ms
      target: 0.995

  dependencies:
    - service: database
      critical: true
      expects:
        availability: 0.99999

The Problem

Reliability requirements are scattered across wikis, runbooks, Slack threads, and tribal knowledge. Services ship to production without defined SLOs, ownership, or operational readiness. Reliability decisions happen in postmortems instead of before deployment.

We have version control for code (Git), infrastructure (Terraform), and policy (OPA). We're missing version control for reliability.

The Solution

OpenSRM defines reliability requirements in a single manifest that travels with your service:

Manifest processing flow: service.reliability.yaml to validate, enforce, and deploy

The Ecosystem

OpenSRM is the foundation for a complete operational reliability stack:

OpenSRM ecosystem overview: Specification, NthLayer, nthlayer-correlate, Consumers, and OTel Semantic Conventions

Components

Component	Purpose	Status
Specification	Core manifest schema	Stable
ai-gate Extension	AI decision services	Stable
Judgment SLOs	Decision quality metrics	Documented
GitHub Action	CI/CD validation	Available
nthlayer-learn	Data primitive for AI judgments	Implemented
NthLayer	Reliability-as-code CLI tool	Alpha
nthlayer-measure	Quality measurement + governance	Implemented
Change Events	OTel semantic conventions	Drafted
Decision Telemetry	OTel semantic conventions	Drafted
nthlayer-correlate	Pre-correlation agent	Architecture
nthlayer-respond	Multi-agent incident response	Architecture

See STATUS.md for detailed progress.

Core Features

Contracts & SLOs

OpenSRM separates what you promise externally (contracts) from what you measure internally (SLOs):

spec:
  contract:
    availability: 0.999
    latency:
      p99: 300ms

  slos:
    availability:
      target: 0.9995    # Internal target is tighter
      window: 30d

Dependency-Aware SLO Feasibility

Declare dependencies with expected guarantees. Tools can calculate your maximum achievable SLO.

dependencies:
  - service: postgresql
    critical: true
    expects:
      availability: 0.9995
  - service: user-service
    critical: true
    manifest: https://github.com/org/user-service/blob/main/service.reliability.yaml
    expects:
      availability: 0.999
      latency:
        p99: 100ms

Ownership & Escalation

Never wonder who owns a service or how to reach them.

ownership:
  team: payments
  slack: "#payments-team"
  escalation: payments-oncall
  pagerduty:
    service_id: PXXXXXX
  runbook: https://wiki.example.com/payment-api-runbook

Observability Requirements

Declare what metrics, dashboards, and alerts must exist.

observability:
  metrics:
    required:
      - http_server_request_duration_seconds
      - http_server_requests_total
  dashboards:
    required: true
  alerts:
    required: true

Deployment Gates

Block deploys based on error budgets, SLO compliance, or recent incidents.

deployment:
  gates:
    error_budget:
      enabled: true
      threshold: 0
    slo_compliance:
      enabled: true
      min_compliance: 0.99
    recent_incidents:
      enabled: true
      lookback: 7d
      max_p1: 0

Templates for Inheritance

Define standard configurations once and inherit across services:

# Template definition
apiVersion: opensrm/v1
kind: Template
metadata:
  name: api-critical
spec:
  slos:
    availability:
      target: 0.9999

# Service inherits from template
metadata:
  name: checkout-api
  template: api-critical

AI Gate Support

For AI-powered decision systems, OpenSRM supports judgment SLOs that measure decision quality, not just uptime. A layered maturity model supports incremental adoption -- from basic reversal tracking through audit sampling, outcome-based ground truth, and segment-level analysis.

spec:
  type: ai-gate
  slos:
    judgment:
      reversal:
        rate:
          target: 0.05
          window: 30d
        high_confidence_failure:
          target: 0.02
          confidence_threshold: 0.9
      audit:
        enabled: true
        sample_rate: 0.10
        accuracy:
          target: 0.95
          window: 30d
      escalation:
        rate:
          min: 0.05
          max: 0.30

See Judgment SLOs for the full framework.

Semantic Conventions

OpenSRM proposes OTel semantic conventions for operational signals:

Change Events

Standardized schema for operational changes (deploys, config, feature flags). These enable correlation between changes and incidents.

# OTel Event
name: change
attributes:
  change.id: chg-deploy-001
  change.type: deployment
  change.scope.service: payment-service
  change.timestamp: "2026-02-20T14:25:00Z"

See conventions/change-events/.

Decision Telemetry

Standardized schema for AI/human decisions and their outcomes.

# AI makes a decision
gen_ai.decision.id: dec-001
gen_ai.decision.value: approve
gen_ai.decision.confidence: 0.87

# Human overrides
gen_ai.reversal.decision_id: dec-001
gen_ai.reversal.type: human_override
gen_ai.reversal.new_value: request_changes

See conventions/decision-telemetry/.

Quick Start

1. Create an OpenSRM manifest

# service.reliability.yaml
apiVersion: opensrm/v1
kind: ServiceReliabilityManifest
metadata:
  name: my-service
  team: my-team
  tier: standard

spec:
  type: api
  slos:
    availability:
      target: 0.999
      window: 30d

2. Validate it

# Using the OpenSRM GitHub Action (recommended)
# See GitHub Action section below

# Or using a JSON Schema validator
npx ajv validate -s spec/v1/schema.json -d service.reliability.yaml

# Or using NthLayer (reference implementation)
nthlayer validate service.reliability.yaml

3. Enforce it

# Check if the service meets its declared requirements
nthlayer check --manifest service.reliability.yaml

# Gate a deployment
nthlayer check-deploy --manifest service.reliability.yaml --exit-on-failure

GitHub Action

Validate OpenSRM manifests in your CI/CD pipeline with the official GitHub Action.

Basic Usage

- uses: rsionnach/opensrm@v1
  with:
    manifest: 'service.reliability.yaml'

Inputs

Input	Description	Required	Default
`manifest`	Path to manifest file (supports glob patterns)	Yes	-
`schema-version`	Schema version to validate against	No	`v1`
`strict`	Fail on warnings (missing recommended fields)	No	`false`

Outputs

Output	Description
`valid`	Whether all manifests are valid (`true`/`false`)
`validated-count`	Number of manifests validated
`warnings-count`	Number of warnings generated

Full Workflow Example

name: Validate OpenSRM

on:
  push:
    paths:
      - '**/*.reliability.yaml'
      - '**/service.reliability.yaml'
  pull_request:
    paths:
      - '**/*.reliability.yaml'
      - '**/service.reliability.yaml'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Validate OpenSRM manifests
        uses: rsionnach/opensrm@v1
        with:
          manifest: '**/*.reliability.yaml'
          strict: 'false'

What You Can Do With OpenSRM

Use Case	Description
Pre-deployment validation	Check that metrics, dashboards, and alerts exist before shipping
SLO feasibility checks	Validate that targets are achievable given dependencies
Drift detection	Alert when declared vs. actual reliability diverges
Deployment gating	Block releases when error budgets are exhausted
Service catalog enrichment	Feed reliability metadata into Backstage, Cortex, etc.
Audit & compliance	Prove that services meet reliability standards

How OpenSRM Is Different

Traditional Approach	OpenSRM
SLOs in wiki pages	SLOs in version-controlled YAML
Ownership in tribal knowledge	Ownership declared and discoverable
Dependencies undocumented	Dependencies explicit with criticality
Observability requirements assumed	Observability requirements enforced
"Is this ready?" = opinion	"Is this ready?" = schema validation

Documentation

Resource	Description
Full Specification	Complete OpenSRM schema reference
JSON Schema	For validation tooling
Judgment SLOs	AI decision quality framework
Examples	Real-world OpenSRM manifest examples
Architecture	Ecosystem architecture and data flows
Shift-Left Reliability Skill	Claude Code skill for generating manifests
Contributing	How to contribute
Governance	RFC process for spec changes

Implementations

Tools that implement OpenSRM:

Tool	Type	Status
NthLayer	CI/CD enforcement	Reference implementation

Full implementations list

Building a tool that implements OpenSRM? Add it to the list.

Design Principles

Schemas + enforcement -- Every component is defined by a specification first. Implementation follows. Define contracts, then validate them.
Shift-left reliability -- Reliability concerns move earlier in the lifecycle. Service manifests define SLOs before deployment. CI/CD gates enforce contracts.
Operator-agnostic -- The stack supports both human and AI operators. nthlayer-correlate snapshots work for dashboards (human) and LLMs (AI). Decision telemetry captures human and AI decisions equally.
Open standards -- Extend existing standards (OTel) rather than invent new ones. Works with Prometheus, Datadog, or any backend.
Reasoning boundary -- Agent capabilities are reserved for components that require interpretation of ambiguous inputs. Deterministic operations (validation, generation, arithmetic) remain as tools. If a component doesn't need to reason, it isn't an agent.

See ARCHITECTURE.md for the full ecosystem architecture.

Relationship to Other Standards

Standard	Relationship
OpenSLO	Complementary -- OpenSRM adds service context around SLO definitions
OpenTelemetry	Extends -- Change events and decision telemetry as OTel conventions
Kubernetes	Aligned -- Manifest structure follows K8s conventions
Backstage	Integrates -- Manifests can populate service catalogs

Contributing

We welcome contributions to OpenSRM. See CONTRIBUTING.md for guidelines.

Major changes go through the RFC process described in GOVERNANCE.md.

License

Apache License 2.0 -- See LICENSE

Acknowledgments

OpenSRM builds on ideas from:

OpenSLO -- SLO specification
OpenTelemetry -- Semantic Conventions
Google SRE Handbook -- SLO/SLI concepts
Backstage -- Service catalog patterns

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.beads		.beads
.github		.github
action		action
articles		articles
assets		assets
components		components
conventions		conventions
diagrams		diagrams
docs/superpowers		docs/superpowers
examples		examples
skills/shift-left-reliability		skills/shift-left-reliability
spec/v1		spec/v1
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
ECOSYSTEM.md		ECOSYSTEM.md
GOVERNANCE.md		GOVERNANCE.md
IMPLEMENTATION-PLAN.md		IMPLEMENTATION-PLAN.md
IMPLEMENTATIONS.md		IMPLEMENTATIONS.md
LICENSE		LICENSE
README.md		README.md
REPO-SPEC.md		REPO-SPEC.md
STATUS.md		STATUS.md
action.yml		action.yml
opensrm-repo-structure.md		opensrm-repo-structure.md
shift-left-reliability-skill.md		shift-left-reliability-skill.md

Folders and files

Latest commit

History

Repository files navigation

OpenSRM

TL;DR

The Problem

The Solution

The Ecosystem

Components

Core Features

Contracts & SLOs

Dependency-Aware SLO Feasibility

Ownership & Escalation

Observability Requirements

Deployment Gates

Templates for Inheritance

AI Gate Support

Semantic Conventions

Change Events

Decision Telemetry

Quick Start

1. Create an OpenSRM manifest

2. Validate it

3. Enforce it

GitHub Action

Basic Usage

Inputs

Outputs

Full Workflow Example

What You Can Do With OpenSRM

How OpenSRM Is Different

Documentation

Implementations

Design Principles

Relationship to Other Standards

Contributing

License

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Languages

Packages