This repository was archived by the owner on May 26, 2026. It is now read-only.
feat(kora): KR-ALERTS-PANEL-FLIP — aggregate real alerts from existing sources#145
Merged
rafe-walker merged 1 commit intoMay 23, 2026
Merged
Conversation
…g sources Flips ``GET /api/alerts/current`` from the v1 stub (PR #134) to a real aggregator that pulls from 5 sources: * OperationalStateHolder (PAUSED / STOPPED) * Cost-ladder holder (WARN_75 / DOWNSHIFT_90 / HARD_STOP_100) * Audit JSONL (webhook.dead_letter / mcp.tool_called result=capability_denied / reasoning.tool_called tool_status=execution_error / slack_dm.reply_failed) — uses kora_cli.audit.jsonl_reader.read_audit_entries from #141 * HealthRollupHolder (not directly emitted today; per-service snapshot data drives service_unhealthy below) * Heartbeat probe snapshots (Vercel / Sentry / Doppler / Supabase / Fly via current_service_snapshots() from #118) 10 rules emit operator-facing alerts: cost_ladder_warned warning cost_ladder_downshifted warning cost_ladder_halted critical operator_paused critical operator_stopped critical webhook_dead_letters_24h warning (threshold > 5) capability_denied_24h info (threshold > 10, forward-compat) reasoning_errors_24h warning (threshold > 5) slack_dm_reply_failed_24h warning (threshold > 3) service_unhealthy warning/critical (per affected service) Fail-soft per-source: each rule helper is wrapped in try/except; a single source failure (holder uninitialized, JSONL unreadable, probe accessor crash) NEVER bubbles a 500 to the operator. Outer ``compute_active_alerts`` also catches helper-bypass exceptions as defense in depth. Sort: severity rank (critical → warning → info) then by alert id for stable intra-tier ordering. K-DG drift caught + documented inline (vs bucket spec §1): * ``get_operational_state_holder()`` (spec) → actual symbol is ``get_holder()`` in agent.operational_state_holder * ``get_health_holder()`` (spec) → actual is ``get_health_rollup_holder()`` in agent.health_rollup_holder * ``current_pct_used()`` / ``active_rung()`` are METHODS on CostStateHolder (#126 PM-locked catch) * ``holder.current`` is @Property on OperationalStateHolder; ``.primary_state`` is bare enum field on the inner dataclass (#112 catch) * HealthRollup is a frozen dataclass with bare field names ``overall`` / ``control_plane`` / ``worker`` — not @Property (#112 catch) * Audit reader at ``kora_cli/audit/jsonl_reader.py:58`` accepts ``seam=`` + ``since=`` kwargs (per #141) Forward-compat note: ``capability_denied_24h`` rule's matcher keys on ``details.result == "capability_denied"``. Today the audit emit at ``mcp_tools.py:714`` runs AFTER the cap-gate at ``mcp.py:181``, so denial responses are NOT logged — rule emits 0 alerts in the current state (documented in aggregator + test). When a follow-on bucket adds audit-on-denial the rule activates automatically. CC#2 #137 fixture-isolation discipline applied (3-namespace ``get_kora_home`` monkeypatch + reset all 5 aggregator sources to baseline in the existing PR #134 stub-test fixture). 3-layer SECURITY contract preserved from PR #134: title/detail plain text, no PII/token shapes in walk-payload sweep, TS interface enforcement. 28 new aggregator tests (per-rule trigger + severity + fail-soft + sort + endpoint integration + walk-payload) + 15 preserved shape/security tests in test_web_server_alerts.py. 123/123 cross-bucket regression (alerts + email panel + slack-dm panel). Ruff clean. After this lands every operator-facing panel reads real data. Last major stub-flip in the cockpit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 23, 2026
rafe-walker
added a commit
that referenced
this pull request
May 23, 2026
) Activates CC#1 forward-compat alert rule from #145. Alerts panel surfaces real capability-denied warnings when callers misconfigured. 725 insertions across 4 files: 1. _emit_capability_denied_audit at the cap-gate — result=capability_denied. 2. _emit_actor_id_required_audit at the stop-tool extra gate — result=actor_id_required (distinct discriminator so the alert rule does not conflate the two operator-fix paths). 3. Both helpers best-effort emit (sink failure caught + logged, never masks the denial response). 4. Reason text omitted from both denial audits (caller-supplied risk). 5. CC#1 pinned test renamed + flipped from no alert → alert fires when threshold exceeded, with 11 denials + 7 ok + 5 actor_id_required entries verifying the rule detail_match does not conflate. 6. Aggregator module + function docstrings updated — no longer forward-compat, now live. 35/35 new tests pass; 413/413 listeners+alerts+clients regression green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Flips `GET /api/alerts/current` from the v1 stub (#134) to a real aggregator pulling from 5 data sources. Backend-only — Alert TS interface unchanged in `web/src/lib/api.ts`.
Bucket spec: `17_cc_bucket_prompts/KR-ALERTS-PANEL-FLIP_real_aggregation.md`
After this lands, every operator-facing panel reads real data. Last major stub-flip in the cockpit.
Source PRs cited
CostRung\` enum +active_rung()` method§1 K-DG drift caught + corrected
The bucket spec's §1 listed two symbols that don't exist verbatim; verified via grep at HEAD `054f4086`:
Documented in the aggregator module docstring under the `K-DG K-DG` heading.
Rule taxonomy
Thresholds are PROPOSED per bucket §2(b); module-level constants for easy tuning. ``service_unhealthy` emits one alert PER affected service (per spec).
Forward-compat: `capability_denied_24h`
The audit emit at `mcp_tools.py:714` is reached AFTER the cap-gate at `mcp.py:181`, so denial responses are NOT currently audit-logged. Rule's matcher uses `details.result == "capability_denied"` so when a follow-on bucket adds audit-on-denial the rule activates automatically; in the meantime it emits zero alerts. Test `test_capability_denied_today_no_alert_since_audit_doesnt_emit_denials` pins this expected behavior.
Fail-soft contract
Each per-source helper wraps its access in try/except + returns `[]` on any failure. Outer `compute_active_alerts` catches helper-bypass exceptions too (defense in depth). Operator NEVER sees 500 from the endpoint.
Tests prove this with 4 fail-soft scenarios: cost holder raises, operational holder None, audit reader raises, probe snapshots accessor raises — each test confirms OTHER rules still emit + the aggregator never propagates.
Surface
3-layer SECURITY contract preserved (from PR #134)
CC#2 #137 fixture-isolation applied
Both test files use the 3-namespace `get_kora_home` monkeypatch (kora_constants + kora_cli.config + kora_cli.web_server) + reset all 5 aggregator sources to a baseline no-alert state. Prevents pytest-xdist parallel workers from seeing each other's source-state.
Test plan
Cascade
Standalone PR. Ship checklist §4 satisfied.
After merge: cockpit complete (every panel reads real data).
🤖 Generated with Claude Code