Multi-agent incident response coordinated by AI.
Incident response involves a lot of repetitive work: classifying severity, identifying blast radius, correlating changes with symptoms, drafting stakeholder updates, deciding whether to rollback, and communicating resolution. Most of this work follows patterns that AI agents can handle reliably, freeing human responders for the judgment calls that actually need them (novel failure modes, business-critical tradeoffs, cross-team coordination).
nthlayer-respond is an incident response system where specialised AI agents collaborate to triage, investigate, communicate, and remediate under human supervision. Each agent has a clear domain, defined decision authority, and its own judgment SLO that measures how often humans need to correct its work. nthlayer-respond owns the incident lifecycle, and uses tools like PagerDuty, Slack, and email as notification channels rather than treating them as upstream incident sources.
Phase 3 is fully implemented: all agents (triage, investigation, communication, remediation), coordinator, CLI, and 8 scenario fixtures are complete, with 168 passing tests.
Alert Source (nthlayer-measure quality breach / Prometheus alert / any webhook)
β
βΌ
nthlayer-correlate Snapshot (correlated context)
β
βΌ
nthlayer-respond Orchestrator (creates incident context)
β
βΌ
Agent Pipeline (triage β investigate + communicate β remediate)
β
βΌ
Notification Channels (PagerDuty / Slack / email / status page)
nthlayer-respond receives alerts from any source (nthlayer-measure's quality breach signals, Prometheus alerting rules, or any webhook), requests a correlated snapshot from nthlayer-correlate for context, then coordinates the response through its agent pipeline. PagerDuty, Slack, email, and status pages are notification channels that nthlayer-respond uses to reach humans when it needs approval or escalation.
nthlayer-respond uses a purpose-built orchestrator (not a general-purpose agent framework) that sequences agents based on the incident lifecycle. The orchestrator itself is not an agent. It's a deterministic state machine that sequences agent execution (transport). Agents reason within their step (judgment). This follows Zero Framework Cognition.
ββββββββββββββββ
β Triage β severity, blast radius, initial assignment
ββββββββ¬ββββββββ
β
βββββββββββββββββββββββββ
βΌ βΌ
ββββββββββββββββ ββββββββββββββββ
βInvestigation β βCommunication β initial stakeholder notification
ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β
β root cause found β
βββββββββββββββββββββββββ€
βΌ βΌ
ββββββββββββββββ ββββββββββββββββ
β Remediation β βCommunication β update with root cause + fix
ββββββββββββββββ ββββββββββββββββ
Triage runs first, then Investigation and Communication run in parallel. When Investigation produces a root cause, Remediation begins and Communication sends an update with the findings and fix.
All nthlayer-respond agents read from and write to a shared incident context object that accumulates findings throughout the incident lifecycle. This is the single accumulating record of what is known about the incident:
incident_context:
id: INC-2026-0142
declared_at: "2026-02-23T14:32:00Z"
source: arbiter_quality_breach
triage:
severity: P1
blast_radius: [checkout-service, payment-gateway]
affected_slos: [checkout-availability, payment-latency-p99]
assigned_teams: [platform-checkout, platform-payments]
investigation:
hypotheses:
- id: H1
description: "model version update at 14:28 introduced quality regression"
confidence: 0.82
evidence: [sitrep-correlation-id-847, arbiter-quality-drop]
- id: H2
description: "database connection pool exhaustion"
confidence: 0.34
evidence: [log-pattern-conn-timeout]
root_cause: H1
communication:
updates_sent:
- channel: "#platform-incidents"
timestamp: "2026-02-23T14:33:12Z"
type: initial_notification
- channel: status_page
timestamp: "2026-02-23T14:35:00Z"
type: investigating
remediation:
proposed_action: rollback_model_version
target: rig-webapp
risk_assessment: low
requires_human_approval: false
executed_at: nullEach agent has a defined domain, specific decision authority, and its own judgment SLO that measures how reliably it performs its role.
Classifies severity based on blast radius and SLO impact from OpenSRM manifests. Determines which services are affected, which teams own them, and how urgent the response needs to be.
Can: Set severity, notify teams (via PagerDuty/Slack as notification channels), assign ownership. Cannot: Remediate. Override existing classification without human approval. Judgment SLO: Reversal rate on severity classifications (target less than 10%).
Generates hypotheses from nthlayer-correlate snapshots, gathers evidence from metrics, logs, and change history, and ranks root causes by confidence. Adapts investigation strategy based on what evidence reveals, following the data rather than a fixed checklist.
Can: Form and rank hypotheses. Declare root cause when confidence exceeds threshold. Cannot: Execute any remediation. Judgment SLO: Root cause agreement with post-incident review (target 70% at maturity).
Produces audience-appropriate messaging via appropriate channels. Selects communication channels based on severity and stakeholder type, and decides timing (too frequent is noise, too infrequent loses trust).
Can: Draft and send updates within pre-approved templates. Choose channels and timing. Cannot: Contradict investigation findings. Communicate resolution until confirmed. Judgment SLO: Human edit rate on outgoing communications (target less than 15%).
Selects and executes fixes based on investigation findings, manifest-defined safe actions, and risk assessment.
Can: Suggest fixes to humans. Execute pre-approved safe actions (rollback, scale up, disable feature flag) without human approval. Cannot: Execute novel remediation not pre-approved in the OpenSRM manifest. Make changes to services outside the blast radius. Judgment SLO: Fix success rate (target 80%).
Agents never take destructive action without human approval unless the action is pre-approved as safe in the OpenSRM manifest. The manifest defines which actions are considered safe for automated execution (like rolling back to a known-good version or scaling up), and everything else requires a human to approve.
Humans make severity calls, approve novel remediations, and override agent decisions. Every agent decision emits OTel telemetry, and every human override feeds back into that agent's judgment SLO. nthlayer-measure's governance layer monitors these judgment SLOs and adjusts agent autonomy accordingly, using the one-way safety ratchet (nthlayer-measure can reduce agent autonomy but cannot increase it without human approval).
After resolution, nthlayer-respond produces structured findings that flow back into the ecosystem rather than sitting in a document that nobody reads again:
- Manifest updates: Findings map to specific OpenSRM manifest changes (tighter SLO targets that were too loose, new dependency declarations that were missing, new safe action definitions for remediation)
- Rule refinements: Quality patterns inform NthLayer's generated alerting rules (alerts that should have fired earlier or didn't fire at all)
- Threshold revisions: nthlayer-measure's historical data informs whether judgment SLO thresholds need adjustment
- Correlation improvements: nthlayer-correlate's accuracy on past incidents calibrates its future correlations
This closes the learning loop so the system improves after every incident rather than just documenting what happened.
nthlayer-respond reads from OpenSRM manifests extensively during incident response:
- Severity tiers and SLO targets determine how urgently a degradation should be treated
- Safe action definitions in the manifest specify which remediation actions the Remediation Agent can execute without human approval
- Dependency topology tells the Investigation Agent which services to check when a dependency is affected
- Escalation paths and ownership metadata tell the Communication Agent which teams to notify and how to reach them
nthlayer-respond consumes from and produces signals for the other ecosystem components:
- nthlayer-correlate provides correlated snapshots as the starting context for every incident, so nthlayer-respond's agents begin with a correlated picture rather than raw signals
- nthlayer-measure provides quality scores that inform whether AI agents in the response are producing reliable diagnostics, and its governance layer adjusts nthlayer-respond's agent autonomy based on measured performance
- NthLayer provides topology exports and deployment gate status, and consumes nthlayer-respond's post-incident findings to refine generated alerting rules
nthlayer-respond is one component in the OpenSRM ecosystem. Each component solves a complete problem independently, and they compose when used together through shared OpenSRM manifests and OTel telemetry conventions.
βββββββββββββββββββββββββββ
β OpenSRM Manifest β
β (the shared contract) β
ββββββββββββββ¬βββββββββββββ
β
reads β reads
βββββββββββββββ¬βββββββ΄βββββββ¬ββββββββββββββ
βΌ βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
β MEASURE β β NthLayer β βCORRELATE β β>RESPOND< β
β β β β β β β β
β quality β β generate β βcorrelate β β incident β
β+govern β β monitoringβ β signals β β response β
β+cost β β β β β β β
ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ
β β β β
βββββββββββββββ΄βββββββ¬βββββββ΄ββββββββββββββ
βΌ
ββββββββββββββββββββββββββββ
β Verdict Store β
β (shared data substrate) β
β create Β· resolve Β· link β
β accuracy Β· gaming-check β
ββββββββββββββ¬ββββββββββββββ
β OTel side-effects
βΌ
ββββββββββββββββββββββββββββ
β OTel Collector / β
β Prometheus / Grafana β
ββββββββββββββββββββββββββββ
Learning loop (post-incident):
nthlayer-respond findings β manifest updates
β NthLayer regenerates β nthlayer-measure
refines β nthlayer-correlate improves β OpenSRM
How nthlayer-respond fits in:
- nthlayer-respond consumes nthlayer-correlate's correlation verdicts (with confidence scores and lineage) as starting context for every incident, so agents begin with a correlated picture rather than raw signals
- Each nthlayer-respond agent emits its own verdicts (triage, investigation, communication, remediation) linked via lineage to the nthlayer-correlate verdicts that informed them β one human override at any point calibrates every component upstream
- nthlayer-measure monitors nthlayer-respond's agent judgment SLOs and adjusts autonomy via the one-way safety ratchet
- NthLayer provides topology exports and deployment gate status, and consumes post-incident findings as rule refinements
Each component works alone. Someone who just needs incident response coordination adopts nthlayer-respond without needing NthLayer, nthlayer-measure, or nthlayer-correlate (though nthlayer-correlate's correlated verdicts significantly enrich nthlayer-respond's context).
| Component | What it does | Link |
|---|---|---|
| OpenSRM | Specification for declaring service reliability requirements | OpenSRM |
| nthlayer-learn | Data primitive for recording AI judgments and measuring correctness | nthlayer-learn |
| nthlayer-measure | Quality measurement and governance for AI agents | nthlayer-measure |
| NthLayer | Generate monitoring infrastructure from manifests | nthlayer |
| nthlayer-correlate | Situational awareness through signal correlation | nthlayer-correlate |
| nthlayer-respond | Multi-agent incident response (this repo) | nthlayer-respond |
nthlayer-respond follows Zero Framework Cognition. The orchestrator is pure transport: it receives the incident trigger, creates the shared context, sequences agent execution through the pipeline, and routes messages. The agents provide judgment: triaging severity, forming hypotheses, drafting communications, and assessing remediation risk. If the model is unavailable, the orchestrator still creates the incident context and routes it to human operators, degrading to "no AI opinion" rather than "no incident response."
Phase 3 is fully implemented. All agents (triage, investigation, communication, remediation), the coordinator, CLI, and 8 scenario fixtures are complete. The test suite has 168 passing tests. See the nthlayer-respond architecture in the OpenSRM repo for the original design specification.
See CONTRIBUTING.md for guidelines.
Apache License 2.0. See LICENSE.