Skip to content
This repository was archived by the owner on May 26, 2026. It is now read-only.

feat(kora): KR-ALERTS-PANEL-FLIP — aggregate real alerts from existing sources#145

Merged
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-ALERTS-PANEL-FLIP
May 23, 2026
Merged

feat(kora): KR-ALERTS-PANEL-FLIP — aggregate real alerts from existing sources#145
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-ALERTS-PANEL-FLIP

Conversation

@rafe-walker

Copy link
Copy Markdown
Owner

Summary

Flips `GET /api/alerts/current` from the v1 stub (#134) to a real aggregator pulling from 5 data sources. Backend-only — Alert TS interface unchanged in `web/src/lib/api.ts`.

Bucket spec: `17_cc_bucket_prompts/KR-ALERTS-PANEL-FLIP_real_aggregation.md`

After this lands, every operator-facing panel reads real data. Last major stub-flip in the cockpit.

Source PRs cited

§1 K-DG drift caught + corrected

The bucket spec's §1 listed two symbols that don't exist verbatim; verified via grep at HEAD `054f4086`:

Spec Actual
`get_operational_state_holder()` `agent.operational_state_holder.get_holder()`
`get_health_holder()` `agent.health_rollup_holder.get_health_rollup_holder()`
`active_rung()` method ✓ confirmed; not a @Property
`holder.current` @Property confirmed (PR #112)
`HealthRollup` bare field names ✓ confirmed (frozen dataclass, no @Property)

Documented in the aggregator module docstring under the `K-DG K-DG` heading.

Rule taxonomy

Rule Source Severity Trigger
`cost_ladder_warned` cost holder warning rung == WARN_75
`cost_ladder_downshifted` cost holder warning rung == DOWNSHIFT_90
`cost_ladder_halted` cost holder critical rung == HARD_STOP_100
`operator_paused` op state critical primary_state == PAUSED
`operator_stopped` op state critical primary_state == STOPPED
`webhook_dead_letters_24h` audit JSONL warning count > 5
`capability_denied_24h` audit JSONL info count > 10 (forward-compat)
`reasoning_errors_24h` audit JSONL warning execution_error count > 5
`slack_dm_reply_failed_24h` audit JSONL warning count > 3
`service_unhealthy` probe snapshots warning per service, critical if unhealthy snapshot.status in {degraded, unhealthy}

Thresholds are PROPOSED per bucket §2(b); module-level constants for easy tuning. ``service_unhealthy` emits one alert PER affected service (per spec).

Forward-compat: `capability_denied_24h`

The audit emit at `mcp_tools.py:714` is reached AFTER the cap-gate at `mcp.py:181`, so denial responses are NOT currently audit-logged. Rule's matcher uses `details.result == "capability_denied"` so when a follow-on bucket adds audit-on-denial the rule activates automatically; in the meantime it emits zero alerts. Test `test_capability_denied_today_no_alert_since_audit_doesnt_emit_denials` pins this expected behavior.

Fail-soft contract

Each per-source helper wraps its access in try/except + returns `[]` on any failure. Outer `compute_active_alerts` catches helper-bypass exceptions too (defense in depth). Operator NEVER sees 500 from the endpoint.

Tests prove this with 4 fail-soft scenarios: cost holder raises, operational holder None, audit reader raises, probe snapshots accessor raises — each test confirms OTHER rules still emit + the aggregator never propagates.

Surface

Layer LOC
`kora_cli/alerts/init.py` (NEW) 16 — public surface re-export
`kora_cli/alerts/aggregator.py` (NEW) 390 — Alert dataclass + 7 rule helpers + compute_active_alerts
`kora_cli/web_server.py` -92 stub + 46 live aggregator call
`tests/kora_cli/alerts/init.py` (NEW) empty package marker
`tests/kora_cli/alerts/test_aggregator.py` (NEW) 605 — 28 tests (per-rule trigger + severity + fail-soft + sort + endpoint integration + walk-payload security)
`tests/kora_cli/test_web_server_alerts.py` updated stub-shape tests + applied CC#2 #137 3-namespace fixture-isolation + baseline-source reset

3-layer SECURITY contract preserved (from PR #134)

  1. title/detail rendered as plain text — FE source pins from PR feat(kora): KR-ALERTS-PANEL — unified operator-attention lens (stub) #134 preserved (AlertsPanel.tsx no dangerouslySetInnerHTML)
  2. Walk-payload sweep — no PII (email/Slack-ID) + no token shapes (Anthropic sk-ant-, Slack xox*-, long-hex, Bearer headers). New aggregator test exercises this against diverse triggered alerts.
  3. TS interface enforcement — Alert.to_dict keys exactly match `web/src/lib/api.ts` Alert interface (test pins).

CC#2 #137 fixture-isolation applied

Both test files use the 3-namespace `get_kora_home` monkeypatch (kora_constants + kora_cli.config + kora_cli.web_server) + reset all 5 aggregator sources to a baseline no-alert state. Prevents pytest-xdist parallel workers from seeing each other's source-state.

Test plan

  • 28 aggregator tests pass
  • 15 preserved security/shape tests in test_web_server_alerts.py pass
  • 123/123 cross-bucket regression (alerts + email panel + slack-dm panel)
  • Ruff clean
  • Stub badge auto-disappears (`stub: false` always, even with empty list)

Cascade

Standalone PR. Ship checklist §4 satisfied.

After merge: cockpit complete (every panel reads real data).

🤖 Generated with Claude Code

…g sources

Flips ``GET /api/alerts/current`` from the v1 stub (PR #134) to a
real aggregator that pulls from 5 sources:

  * OperationalStateHolder (PAUSED / STOPPED)
  * Cost-ladder holder (WARN_75 / DOWNSHIFT_90 / HARD_STOP_100)
  * Audit JSONL (webhook.dead_letter / mcp.tool_called
    result=capability_denied / reasoning.tool_called
    tool_status=execution_error / slack_dm.reply_failed) — uses
    kora_cli.audit.jsonl_reader.read_audit_entries from #141
  * HealthRollupHolder (not directly emitted today; per-service
    snapshot data drives service_unhealthy below)
  * Heartbeat probe snapshots (Vercel / Sentry / Doppler /
    Supabase / Fly via current_service_snapshots() from #118)

10 rules emit operator-facing alerts:

  cost_ladder_warned       warning
  cost_ladder_downshifted  warning
  cost_ladder_halted       critical
  operator_paused          critical
  operator_stopped         critical
  webhook_dead_letters_24h warning  (threshold > 5)
  capability_denied_24h    info     (threshold > 10, forward-compat)
  reasoning_errors_24h     warning  (threshold > 5)
  slack_dm_reply_failed_24h warning (threshold > 3)
  service_unhealthy        warning/critical (per affected service)

Fail-soft per-source: each rule helper is wrapped in try/except;
a single source failure (holder uninitialized, JSONL unreadable,
probe accessor crash) NEVER bubbles a 500 to the operator. Outer
``compute_active_alerts`` also catches helper-bypass exceptions
as defense in depth.

Sort: severity rank (critical → warning → info) then by alert id
for stable intra-tier ordering.

K-DG drift caught + documented inline (vs bucket spec §1):
  * ``get_operational_state_holder()`` (spec) → actual symbol is
    ``get_holder()`` in agent.operational_state_holder
  * ``get_health_holder()`` (spec) → actual is
    ``get_health_rollup_holder()`` in agent.health_rollup_holder
  * ``current_pct_used()`` / ``active_rung()`` are METHODS on
    CostStateHolder (#126 PM-locked catch)
  * ``holder.current`` is @Property on OperationalStateHolder;
    ``.primary_state`` is bare enum field on the inner dataclass
    (#112 catch)
  * HealthRollup is a frozen dataclass with bare field names
    ``overall`` / ``control_plane`` / ``worker`` — not @Property
    (#112 catch)
  * Audit reader at ``kora_cli/audit/jsonl_reader.py:58`` accepts
    ``seam=`` + ``since=`` kwargs (per #141)

Forward-compat note: ``capability_denied_24h`` rule's matcher
keys on ``details.result == "capability_denied"``. Today the
audit emit at ``mcp_tools.py:714`` runs AFTER the cap-gate at
``mcp.py:181``, so denial responses are NOT logged — rule emits
0 alerts in the current state (documented in aggregator + test).
When a follow-on bucket adds audit-on-denial the rule activates
automatically.

CC#2 #137 fixture-isolation discipline applied (3-namespace
``get_kora_home`` monkeypatch + reset all 5 aggregator sources
to baseline in the existing PR #134 stub-test fixture).

3-layer SECURITY contract preserved from PR #134: title/detail
plain text, no PII/token shapes in walk-payload sweep, TS
interface enforcement.

28 new aggregator tests (per-rule trigger + severity + fail-soft
+ sort + endpoint integration + walk-payload) + 15 preserved
shape/security tests in test_web_server_alerts.py. 123/123
cross-bucket regression (alerts + email panel + slack-dm panel).
Ruff clean.

After this lands every operator-facing panel reads real data.
Last major stub-flip in the cockpit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rafe-walker rafe-walker merged commit b959737 into feature/phase2-upgrades May 23, 2026
@rafe-walker rafe-walker deleted the feat/kora-KR-ALERTS-PANEL-FLIP branch May 23, 2026 20:25
rafe-walker added a commit that referenced this pull request May 23, 2026
)

Activates CC#1 forward-compat alert rule from #145. Alerts panel surfaces real capability-denied warnings when callers misconfigured.

725 insertions across 4 files:
1. _emit_capability_denied_audit at the cap-gate — result=capability_denied.
2. _emit_actor_id_required_audit at the stop-tool extra gate — result=actor_id_required (distinct discriminator so the alert rule does not conflate the two operator-fix paths).
3. Both helpers best-effort emit (sink failure caught + logged, never masks the denial response).
4. Reason text omitted from both denial audits (caller-supplied risk).
5. CC#1 pinned test renamed + flipped from no alert → alert fires when threshold exceeded, with 11 denials + 7 ok + 5 actor_id_required entries verifying the rule detail_match does not conflate.
6. Aggregator module + function docstrings updated — no longer forward-compat, now live.

35/35 new tests pass; 413/413 listeners+alerts+clients regression green.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant