Skip to content

rsionnach/opensrm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenSRM

The Open Service Reliability Manifest

A specification and ecosystem for declaring, enforcing, and measuring service reliability as code.


Status: Draft License: Apache 2.0 Spec: v1


TL;DR

# service.reliability.yaml
apiVersion: opensrm/v1
kind: ServiceReliabilityManifest
metadata:
  name: payment-api
  team: payments
  tier: critical

spec:
  type: api

  slos:
    availability:
      target: 0.9995
      window: 30d
    latency:
      p99: 200ms
      target: 0.995

  dependencies:
    - service: database
      critical: true
      expects:
        availability: 0.99999

The Problem

Reliability requirements are scattered across wikis, runbooks, Slack threads, and tribal knowledge. Services ship to production without defined SLOs, ownership, or operational readiness. Reliability decisions happen in postmortems instead of before deployment.

We have version control for code (Git), infrastructure (Terraform), and policy (OPA). We're missing version control for reliability.

The Solution

OpenSRM defines reliability requirements in a single manifest that travels with your service:

Manifest processing flow: service.reliability.yaml to validate, enforce, and deploy

The Ecosystem

OpenSRM is the foundation for a complete operational reliability stack:

OpenSRM ecosystem overview: Specification, NthLayer, nthlayer-correlate, Consumers, and OTel Semantic Conventions

Components

Component Purpose Status
Specification Core manifest schema Stable
ai-gate Extension AI decision services Stable
Judgment SLOs Decision quality metrics Documented
GitHub Action CI/CD validation Available
nthlayer-learn Data primitive for AI judgments Implemented
NthLayer Reliability-as-code CLI tool Alpha
nthlayer-measure Quality measurement + governance Implemented
Change Events OTel semantic conventions Drafted
Decision Telemetry OTel semantic conventions Drafted
nthlayer-correlate Pre-correlation agent Architecture
nthlayer-respond Multi-agent incident response Architecture

See STATUS.md for detailed progress.


Core Features

Contracts & SLOs

OpenSRM separates what you promise externally (contracts) from what you measure internally (SLOs):

spec:
  contract:
    availability: 0.999
    latency:
      p99: 300ms

  slos:
    availability:
      target: 0.9995    # Internal target is tighter
      window: 30d

Dependency-Aware SLO Feasibility

Declare dependencies with expected guarantees. Tools can calculate your maximum achievable SLO.

dependencies:
  - service: postgresql
    critical: true
    expects:
      availability: 0.9995
  - service: user-service
    critical: true
    manifest: https://github.com/org/user-service/blob/main/service.reliability.yaml
    expects:
      availability: 0.999
      latency:
        p99: 100ms

Ownership & Escalation

Never wonder who owns a service or how to reach them.

ownership:
  team: payments
  slack: "#payments-team"
  escalation: payments-oncall
  pagerduty:
    service_id: PXXXXXX
  runbook: https://wiki.example.com/payment-api-runbook

Observability Requirements

Declare what metrics, dashboards, and alerts must exist.

observability:
  metrics:
    required:
      - http_server_request_duration_seconds
      - http_server_requests_total
  dashboards:
    required: true
  alerts:
    required: true

Deployment Gates

Block deploys based on error budgets, SLO compliance, or recent incidents.

deployment:
  gates:
    error_budget:
      enabled: true
      threshold: 0
    slo_compliance:
      enabled: true
      min_compliance: 0.99
    recent_incidents:
      enabled: true
      lookback: 7d
      max_p1: 0

Templates for Inheritance

Define standard configurations once and inherit across services:

# Template definition
apiVersion: opensrm/v1
kind: Template
metadata:
  name: api-critical
spec:
  slos:
    availability:
      target: 0.9999

# Service inherits from template
metadata:
  name: checkout-api
  template: api-critical

AI Gate Support

For AI-powered decision systems, OpenSRM supports judgment SLOs that measure decision quality, not just uptime. A layered maturity model supports incremental adoption -- from basic reversal tracking through audit sampling, outcome-based ground truth, and segment-level analysis.

spec:
  type: ai-gate
  slos:
    judgment:
      reversal:
        rate:
          target: 0.05
          window: 30d
        high_confidence_failure:
          target: 0.02
          confidence_threshold: 0.9
      audit:
        enabled: true
        sample_rate: 0.10
        accuracy:
          target: 0.95
          window: 30d
      escalation:
        rate:
          min: 0.05
          max: 0.30

See Judgment SLOs for the full framework.


Semantic Conventions

OpenSRM proposes OTel semantic conventions for operational signals:

Change Events

Standardized schema for operational changes (deploys, config, feature flags). These enable correlation between changes and incidents.

# OTel Event
name: change
attributes:
  change.id: chg-deploy-001
  change.type: deployment
  change.scope.service: payment-service
  change.timestamp: "2026-02-20T14:25:00Z"

See conventions/change-events/.

Decision Telemetry

Standardized schema for AI/human decisions and their outcomes.

# AI makes a decision
gen_ai.decision.id: dec-001
gen_ai.decision.value: approve
gen_ai.decision.confidence: 0.87

# Human overrides
gen_ai.reversal.decision_id: dec-001
gen_ai.reversal.type: human_override
gen_ai.reversal.new_value: request_changes

See conventions/decision-telemetry/.


Quick Start

1. Create an OpenSRM manifest

# service.reliability.yaml
apiVersion: opensrm/v1
kind: ServiceReliabilityManifest
metadata:
  name: my-service
  team: my-team
  tier: standard

spec:
  type: api
  slos:
    availability:
      target: 0.999
      window: 30d

2. Validate it

# Using the OpenSRM GitHub Action (recommended)
# See GitHub Action section below

# Or using a JSON Schema validator
npx ajv validate -s spec/v1/schema.json -d service.reliability.yaml

# Or using NthLayer (reference implementation)
nthlayer validate service.reliability.yaml

3. Enforce it

# Check if the service meets its declared requirements
nthlayer check --manifest service.reliability.yaml

# Gate a deployment
nthlayer check-deploy --manifest service.reliability.yaml --exit-on-failure

GitHub Action

Validate OpenSRM manifests in your CI/CD pipeline with the official GitHub Action.

Basic Usage

- uses: rsionnach/opensrm@v1
  with:
    manifest: 'service.reliability.yaml'

Inputs

Input Description Required Default
manifest Path to manifest file (supports glob patterns) Yes -
schema-version Schema version to validate against No v1
strict Fail on warnings (missing recommended fields) No false

Outputs

Output Description
valid Whether all manifests are valid (true/false)
validated-count Number of manifests validated
warnings-count Number of warnings generated

Full Workflow Example

name: Validate OpenSRM

on:
  push:
    paths:
      - '**/*.reliability.yaml'
      - '**/service.reliability.yaml'
  pull_request:
    paths:
      - '**/*.reliability.yaml'
      - '**/service.reliability.yaml'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Validate OpenSRM manifests
        uses: rsionnach/opensrm@v1
        with:
          manifest: '**/*.reliability.yaml'
          strict: 'false'

What You Can Do With OpenSRM

Use Case Description
Pre-deployment validation Check that metrics, dashboards, and alerts exist before shipping
SLO feasibility checks Validate that targets are achievable given dependencies
Drift detection Alert when declared vs. actual reliability diverges
Deployment gating Block releases when error budgets are exhausted
Service catalog enrichment Feed reliability metadata into Backstage, Cortex, etc.
Audit & compliance Prove that services meet reliability standards

How OpenSRM Is Different

Traditional Approach OpenSRM
SLOs in wiki pages SLOs in version-controlled YAML
Ownership in tribal knowledge Ownership declared and discoverable
Dependencies undocumented Dependencies explicit with criticality
Observability requirements assumed Observability requirements enforced
"Is this ready?" = opinion "Is this ready?" = schema validation

Documentation

Resource Description
Full Specification Complete OpenSRM schema reference
JSON Schema For validation tooling
Judgment SLOs AI decision quality framework
Examples Real-world OpenSRM manifest examples
Architecture Ecosystem architecture and data flows
Shift-Left Reliability Skill Claude Code skill for generating manifests
Contributing How to contribute
Governance RFC process for spec changes

Implementations

Tools that implement OpenSRM:

Tool Type Status
NthLayer CI/CD enforcement Reference implementation

Full implementations list

Building a tool that implements OpenSRM? Add it to the list.


Design Principles

  1. Schemas + enforcement -- Every component is defined by a specification first. Implementation follows. Define contracts, then validate them.
  2. Shift-left reliability -- Reliability concerns move earlier in the lifecycle. Service manifests define SLOs before deployment. CI/CD gates enforce contracts.
  3. Operator-agnostic -- The stack supports both human and AI operators. nthlayer-correlate snapshots work for dashboards (human) and LLMs (AI). Decision telemetry captures human and AI decisions equally.
  4. Open standards -- Extend existing standards (OTel) rather than invent new ones. Works with Prometheus, Datadog, or any backend.
  5. Reasoning boundary -- Agent capabilities are reserved for components that require interpretation of ambiguous inputs. Deterministic operations (validation, generation, arithmetic) remain as tools. If a component doesn't need to reason, it isn't an agent.

See ARCHITECTURE.md for the full ecosystem architecture.


Relationship to Other Standards

Standard Relationship
OpenSLO Complementary -- OpenSRM adds service context around SLO definitions
OpenTelemetry Extends -- Change events and decision telemetry as OTel conventions
Kubernetes Aligned -- Manifest structure follows K8s conventions
Backstage Integrates -- Manifests can populate service catalogs

Contributing

We welcome contributions to OpenSRM. See CONTRIBUTING.md for guidelines.

Major changes go through the RFC process described in GOVERNANCE.md.


License

Apache License 2.0 -- See LICENSE


Acknowledgments

OpenSRM builds on ideas from:

About

An open specification for declaring service reliability requirements as code. Define SLOs, dependencies, ownership, and observability in version-controlled YAML.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages