Situational awareness through automated signal correlation.
Enterprise-scale distributed systems produce an enormous volume of observability signals: metrics at 15-second intervals across thousands of services, structured logs on every request, distributed traces, alerts from multiple monitoring systems, change events from CI/CD pipelines, feature flag changes, and infrastructure scaling events. This is millions of events per minute. No human can correlate across all of these signals during an incident by reading dashboards, and no existing tool pre-processes these signals into a form that AI agents (or humans) can consume efficiently.
nthlayer-correlate solves this by continuously pre-correlating signals in the background so that when something goes wrong, the correlated picture is already built. Rather than querying raw events at incident time (which is too slow and too noisy at scale), nthlayer-correlate groups related signals, computes temporal proximity, identifies co-occurring changes, and maintains a rolling window of pre-correlated state. When an incident happens, generating a situational snapshot takes seconds rather than minutes of ad-hoc querying across Prometheus, Loki, Jaeger, and your change history.
Phase 2 Tier 1 is fully implemented. The design documented below reflects the implemented architecture.
Prometheus handles metrics. Loki handles logs. Jaeger handles traces. But correlating across all three plus change events plus quality scores at enterprise scale is an unsolved problem that most teams handle manually during incidents (or don't handle at all). A company running thousands of services (think SaaS platforms like Workday, Stripe, Twilio) needs something that continuously watches across signal types and has the correlated view ready before anyone asks for it.
Agentic systems add additional volume on top of the enterprise baseline. AI agent decisions, quality scores, model version changes, prompt updates, and adapter deployments all produce signals that existing observability infrastructure wasn't designed to handle. nthlayer-correlate exists because raw observability data at enterprise scale is unusable without a pre-correlation layer.
This is the core architectural concept. Pre-correlation is the difference between "let me spend 20 minutes querying four different systems during an incident" and "here's what happened, already correlated, in 3 seconds."
nthlayer-correlate continuously runs in the background, grouping related signals by service, time window, and topology. When a metric anomaly appears near a deployment event for a related service, nthlayer-correlate has already noted the temporal proximity before anyone asks. The pre-correlated data is indexed and ready for snapshot generation at any time.
Pre-correlation itself is transport (deterministic grouping, windowing, counting). Interpreting what the correlations mean is judgment (the model decides whether a temporal correlation is likely causal). This follows Zero Framework Cognition.
A snapshot is a point-in-time document that answers "what's happening right now?" with structured evidence. Every snapshot follows the same schema regardless of how it was triggered:
snapshot:
id: sitrep-2026-03-06T14:23:00Z
triggered_by: alert | schedule | manual
window: 15m
severity: info | warning | critical
summary: "model-generated natural language summary"
signals:
- source: arbiter
type: quality_degradation
detail: "worker ace-mjxwfy7e rejection rate 0.33 (threshold 0.20)"
timestamp: 2026-03-06T14:18:00Z
- source: otel
type: deploy
detail: "model version updated on rig-webapp 12m ago"
timestamp: 2026-03-06T14:11:00Z
correlations:
- signals: [0, 1]
confidence: 0.82
interpretation: "quality degradation started within 7m of model version change"
topology:
affected_services: [webapp, api-gateway]
dependency_chain: [webapp -> api-gateway -> database]
recommended_actions:
- "investigate model version change on rig-webapp"
- "check if other workers on same rig are affected"The schema captures what happened (signals), what's related (correlations with confidence scores), what's affected (topology from OpenSRM manifests), and what to do next (recommended actions). The signals and topology sections are transport (structured data from known sources). The summary, correlation interpretation, and recommended actions are judgment (model-generated).
At enterprise scale, nthlayer-correlate needs a streaming/queuing layer between event producers and the correlation engine. Raw events from OTel collectors, monitoring systems, CI/CD pipelines, change event sources, and quality score producers flow through a message queue that handles backpressure, replay, and fan-out.
- Enterprise scale: Kafka, with partitioning by service and topics by signal type. Kafka's compaction and replay capabilities are designed for exactly this volume, and its consumer group model maps naturally to having multiple ecosystem components (nthlayer-correlate, nthlayer-measure, nthlayer-respond) each consuming the same event stream independently.
- Smaller deployments: NATS provides a lighter-weight alternative for teams that don't need Kafka's full feature set.
nthlayer-correlate consumes from the queue, pre-correlates, and stores the results. This decouples event production rate from correlation processing rate, which is essential when thousands of services are each producing metrics, logs, and traces continuously.
nthlayer-correlate generates snapshots in three modes, each producing the same schema but with different urgency and depth:
- Batch (periodic): Lightweight summaries every N minutes (default: 5 minutes in WATCHING state) for continuous situational awareness. These snapshots capture the ambient state of the system.
- Incident-triggered: On alert firing, nthlayer-correlate pulls in more context and performs deeper correlation. These snapshots are richer and more detailed, designed to give an incident responder (human or agent) immediate context.
- Refresh (on-demand): When a human or agent requests an updated picture, nthlayer-correlate generates a fresh snapshot incorporating any new information that arrived since the last one. During active incidents, refresh snapshots run on a 1-minute cycle.
nthlayer-correlate operates in distinct states that affect its behaviour:
| State | Trigger | Behaviour |
|---|---|---|
| WATCHING | Normal operations | Background correlation, 5-minute snapshot cycle |
| ALERT | Elevated signal detected | Increased correlation frequency, broader signal ingestion |
| INCIDENT | Incident declared | Continuous reassessment, 1-minute snapshots, pushes context to nthlayer-respond |
| DEGRADED | Own judgment SLO metrics below threshold | Conservative mode, reduced confidence in correlations, flags for human review |
The DEGRADED state is important: nthlayer-correlate monitors its own quality and reduces confidence when it detects its correlations are less reliable. This is self-awareness as a feature, not an afterthought.
When quality degrades (signalled by nthlayer-measure), nthlayer-correlate looks for recent changes that temporally correlate with the degradation. It consumes changes via the standardised change event schema defined in the OpenSRM spec, which means all change sources (deploys, config updates, model version swaps, prompt changes, adapter deployments, formula revisions) arrive in a uniform format:
change_event:
id: chg-2026-03-06-001
timestamp: "2026-03-06T14:11:00Z"
type: model_version
scope:
service: webapp
environment: production
rig: rig-webapp
source: model-registry
actor: deploy-pipeline
detail:
from_version: "claude-sonnet-4-20250514"
to_version: "claude-sonnet-4-20250715"
risk: low
rollback_available: truenthlayer-correlate doesn't need per-source integrations because the change event schema normalises everything. The pre-correlation layer continuously maintains a rolling window of changes, so when a quality signal fires, the candidate causes are already indexed. Identifying the candidate set is transport (pre-computed by the correlation engine). Evaluating whether a temporal correlation is causal is judgment (the model decides).
nthlayer-correlate consumes signals from multiple source types:
- OTel metrics and traces via OTel Collector (Prometheus remote write, OTLP)
- Alerts from Alertmanager (webhook)
- Change events from all sources, normalised via the OpenSRM change event schema (GitHub, ArgoCD, LaunchDarkly, model registries, prompt management systems)
- Quality scores from nthlayer-measure (OTel metrics)
- Deployment records from CI/CD pipelines
nthlayer-correlate reads service topology from OpenSRM manifests to understand dependency relationships when correlating signals. A quality drop in service A that depends on service B (as declared in the manifest) triggers nthlayer-correlate to check service B's signals automatically. The manifest provides the dependency graph that makes topology-aware correlation possible.
nthlayer-correlate has its own judgment SLOs, measured through nthlayer-measure's governance framework:
- Correlation accuracy: What percentage of nthlayer-correlate's 'related change' assessments do humans agree with?
- False positive rate: How often does nthlayer-correlate flag a change as incident-related when it isn't?
Every correlation assessment emits a gen_ai.decision.* OTel event, and human disagreements emit gen_ai.override.* events that feed nthlayer-correlate's own quality measurement. If nthlayer-correlate's correlation quality drops, nthlayer-measure's governance layer can reduce nthlayer-correlate's confidence levels or flag it for human review.
nthlayer-correlate is one component in the OpenSRM ecosystem. Each component solves a complete problem independently, and they compose when used together through shared OpenSRM manifests and OTel telemetry conventions.
┌─────────────────────────┐
│ OpenSRM Manifest │
│ (the shared contract) │
└────────────┬────────────┘
│
reads │ reads
┌─────────────┬──────┴──────┬─────────────┐
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ MEASURE │ │ NthLayer │ │>CORRELATE│ │ RESPOND │
│ │ │ │ │ │ │ │
│ quality │ │ generate │ │correlate │ │ incident │
│+govern │ │ monitoring│ │ signals │ │ response │
│+cost │ │ │ │ │ │ │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │ │
└─────────────┴──────┬──────┴─────────────┘
▼
┌──────────────────────────┐
│ Verdict Store │
│ (shared data substrate) │
│ create · resolve · link │
│ accuracy · gaming-check │
└────────────┬─────────────┘
│ OTel side-effects
▼
┌──────────────────────────┐
│ OTel Collector / │
│ Prometheus / Grafana │
└──────────────────────────┘
Learning loop (post-incident):
nthlayer-respond findings → manifest updates
→ NthLayer regenerates → nthlayer-measure
refines → nthlayer-correlate improves → OpenSRM
How nthlayer-correlate fits in:
- nthlayer-correlate emits correlation verdicts for every correlation assessment, stored in the shared Verdict Store. nthlayer-respond consumes these verdicts (with confidence scores and lineage) as the starting context for incident response — no direct coupling between components.
- nthlayer-correlate consumes nthlayer-measure quality verdicts as events and correlates them with other signals (deployments, config changes, model version swaps) to identify what caused quality degradation
- nthlayer-correlate reads service topology from OpenSRM manifests (via NthLayer's topology export) to understand dependency relationships when correlating
- nthlayer-correlate's correlation accuracy improves over time as the learning loop feeds post-incident findings back into its models
Each component works alone. Someone who just needs signal correlation adopts nthlayer-correlate without needing nthlayer-measure, NthLayer, or nthlayer-respond.
| Component | What it does | Link |
|---|---|---|
| OpenSRM | Specification for declaring service reliability requirements | OpenSRM |
| nthlayer-learn | Data primitive for recording AI judgments and measuring correctness | nthlayer-learn |
| nthlayer-measure | Quality measurement and governance for AI agents | nthlayer-measure |
| NthLayer | Generate monitoring infrastructure from manifests | nthlayer |
| nthlayer-correlate | Situational awareness through signal correlation (this repo) | nthlayer-correlate |
| nthlayer-respond | Multi-agent incident response | nthlayer-respond |
nthlayer-correlate follows Zero Framework Cognition. The boundary is clear:
Transport (code): Ingesting events from the streaming layer, grouping signals by service and time window, maintaining the rolling pre-correlation index, computing temporal proximity between signals, generating the structured snapshot schema, publishing snapshots via API and SSE.
Judgment (model): Interpreting what correlations mean, assessing whether a temporal correlation is likely causal, generating the natural language summary, recommending actions, deciding the snapshot severity level.
Phase 2 Tier 1 of nthlayer-correlate is fully implemented. The design documented here reflects the implemented architecture. The pre-correlation concept has been validated in the existing OpenSRM ecosystem design (see the nthlayer-correlate technical appendix in the OpenSRM repo).
See CONTRIBUTING.md for guidelines.
Apache License 2.0. See LICENSE.