This repository was archived by the owner on May 26, 2026. It is now read-only.
KR-SNAPSHOT-DAEMON-HEALTH — daemon's own listeners/uptime/errors (schema v4)#170
Merged
rafe-walker merged 1 commit intoMay 24, 2026
Conversation
…/errors (schema v4)
* Bump snapshot schema_version 3 → 4.
* Add `snapshot.daemon_health` section — Kora's own runtime health,
distinct from `service_health` (external SaaS dependencies):
- `overall_status` — healthy / degraded / unhealthy / unknown,
derived from per-listener `down` count + recent error count.
- `boot_at` — ISO 8601 UTC wall-clock stamp at startup completion;
populated from new `DaemonCoordinator.get_boot_at()` accessor.
`"unknown"` while booting or when daemon isn't running.
- `uptime_seconds` — reuses existing `get_status()` monotonic
uptime field.
- `listeners` — per-listener `{status, last_event_at,
consecutive_errors}`. `status` derived from coordinator's
`started` flag; `last_event_at` + `consecutive_errors` are
`"unknown"` in v1 (no listener exposes those today; per-
listener event tracking is a follow-on bucket).
- `recent_error_count_5min` — tails `kora_audit_log.jsonl` for
failure-shaped seams (`webhook.dead_letter`,
`slack_dm.reply_failed`, `notification.dispatched` with
`status="failed"`). Reuses the existing
`read_audit_entries(since=...)` reader. Fail-soft: missing
file → 0.
* Coordinator gets a wall-clock `_startup_completed_wall_at`
stamped in lockstep with the existing monotonic
`_startup_completed_at`, exposed via `get_boot_at()`. No
existing accessor returned wall-clock boot time; monotonic
alone can't serialize to an ISO timestamp.
* No per-listener files modified — listener-level event tracking is
out of scope for v1 (spec §4 "describe + adapt"; 14 listeners
found vs 9 in spec — modifying 14 unrelated files for v1
would be churn). The shape is forward-compatible: future
buckets fill in `last_event_at` / `consecutive_errors` per
listener without breaking consumers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
rafe-walker
added a commit
that referenced
this pull request
May 24, 2026
…n holdouts (#174) 4-of-4 dashboard hero fields snapshot-driven on warm cache. Verified at DashboardPage.tsx:1270-1345 — loadInitial does NOT call api.getOperationalState / getCurrentAlerts / getCostState / getHealthRollup with fresh snapshot. PR #162 anti-projection tests FLIPPED (not deleted): test_dashboard_does_not_project_*_from_snapshot → test_dashboard_projects_*_from_snapshot with null-return + per-field fallback pins. New pins: badge surfaces N-of-M hero count + DASHBOARD_HERO_FIELD_COUNT===4 literal regression guard. K-DG drift caught + fixed: SnapshotResponse TS type was stale — missing cost_ladder.spent_to_date_usd + credit_pool_usd (#169) + the entire daemon_health section (#170). Added; required before projection helpers could compile. Per-card retry buttons + forceFullLiveRefresh still fan out by design (operator-triggered force paths). 19/19 tests pass, tsc + vite build clean.
This was referenced May 24, 2026
rafe-walker
pushed a commit
that referenced
this pull request
May 24, 2026
…tion + revert Build-list follow-on to PR #167 (phrasebook read-only viewer). Adds the operator-edit story so the phrasebook can be modified from the cockpit instead of YAML by hand. Also sets the UX foundation that the eventual promotion-review panel will reuse (KR-PROMOTE-PHRASEBOOK proposals are pending edits to this same surface). Backend (kora_cli/short_circuit/phrasebook_editor.py — new) ========================================================== Module hosts everything the read-only viewer didn't need: * validate_entries(entries) → [EntryValidationError, ...] 8 checks (in order; later skipped for an entry that failed earlier to avoid noise): payload is list, length cap (200 entries), required-fields-non-empty, length caps on each field, regex compiles with re.IGNORECASE, catastrophic- backtracking guard (catches (a+)+, (.*)*, etc), snapshot placeholder paths resolve to known scalars, no duplicate (pattern, category) tuples. * SNAPSHOT_SCALAR_PATHS — static frozen-set of known snapshot v4 scalar paths the operator can reference in {snapshot.X.Y} placeholders. Drift-guarded by test_static_schema_matches_snapshot_collectors which greps each leaf key against the snapshot collector source. Only scalars (no dynamic-key dicts like listeners.X or cost_telemetry.X — those str() to Python repr and would render garbage in operator-facing DMs). * write_backup_for / rotate_backups / list_backups — timestamped backups under ${KORA_HOME}/phrasebook/backups/. KORA_PHRASEBOOK_BACKUP_COUNT env var tunes rotation (default 10; clamped to [1, 1000]). Filenames sort chronologically as plain strings (ISO-Z format). * write_phrasebook(entries) — atomic write via utils.atomic_replace (same pattern as snapshot writer). Deterministic field order for readable YAML diffs. * revert_phrasebook(filename=...) — revert to a specific backup OR (no filename) the most-recent OR (no backups) remove the override entirely. Path-traversal defense rejects any filename containing /, \, .., or not matching the slack_dm.*.yml shape. Backend (kora_cli/web_server.py — 3 new endpoints) ================================================== PUT /api/phrasebook/slack_dm — validate + back up + atomic-write + audit POST /api/phrasebook/slack_dm/revert — revert to backup GET /api/phrasebook/slack_dm/backups — list backups PUT contract: * Validation fails → 422 with structured per-entry errors; NO write; previous override preserved. * Validation passes → backup current (if exists) → atomic write → rotate backups → audit row → 200 with echoed entries + backup_filename + rotated_count. * Write itself fails → 500; backup preserved so operator can recover. Audit seam (new): phrasebook.updated ==================================== Extended SeamName Literal in kora_cli/audit/jsonl_sink.py. Each successful PUT (or revert) emits one entry with: * actor — "operator" (cockpit-driven; future "kora_proposal_approved" from the promotion-loop bucket reuses this shape) * action — "put" | "revert" * entry_count_before / entry_count_after * backup_filename (when applicable) * rotated_backup_count (when applicable) * reverted_to (when action=revert) Drives the future KR-PROMOTION-REVIEW-PANEL via the existing audit-panel infrastructure (KR-AUDIT-PANEL-ENDPOINTS PR #155). Frontend ======== * web/src/pages/PhrasebookEditor.tsx (new) — hosts the editor sub-components so PhrasebookPage stays readable: EntryEditorRow (4 inline editable fields + per-field validation errors), EditModeControls (Save / Cancel / Add), BackupsDialog (modal with newest-first list + per-row revert + confirm), ClientSidePreview (mirrors dm_phrasebook.match + render_reply in TS so operator can preview in-progress edits without saving). * api wrappers + types: putSlackDmPhrasebook / revertSlackDmPhrasebook / getSlackDmPhrasebookBackups + PhrasebookEntryWrite / PhrasebookPutResponse / PhrasebookValidationErrorEntry/Body / PhrasebookRevertResponse / PhrasebookBackupItem / PhrasebookBackupsResponse. * PhrasebookPage extended with edit-mode toggle. View-mode surface (read-only table + live tester) is unchanged for operators who just want to inspect. Edit-mode swaps to editor rows + client-side preview + Save/Cancel/Add. Backups button in view-mode opens the revert dialog. * 422 validation-error parsing: Save handler unmarshals fetchJSON's "STATUS: BODY" Error message; on 422 with error="validation_failed" the body's per-entry errors are routed to the editor for inline rendering. Tests ===== Backend (tests/kora_cli/test_phrasebook_editor.py — 44 tests): Validation (14): valid entries, non-list, missing field, empty whitespace, all 4 length caps, entries count cap, invalid regex, catastrophic-backtracking parametrized across 4 pathological patterns, unknown vs known snapshot path, duplicate (pattern, category), schema drift guard against state_snapshot.py. Backups (5): no-override returns None, copies content + timestamps filename, rotation keeps N most recent, env override + clamping parametrized, list returns newest- first with entry_count. Write+revert (5): round-trip load, revert by name, revert most-recent, revert with no backups removes override, path-traversal rejection parametrized across 5 attacks. Endpoints (7): PUT valid + audit, PUT invalid + preserve, PUT no-existing-override, POST revert + audit, POST revert invalid filename → 400, POST revert missing → 404, GET backups newest-first. Audit (1): SeamName Literal includes phrasebook.updated. Integration (1): round-trip PUT then revert restores seed. FE source-pins (tests/kora_cli/test_phrasebook_editor_fe_pins.py — 17 tests): api wrappers (3), TS types declared + shape pinned (2), editor file exists + exports (1), page wiring (3), edit button hidden in edit-mode (1), 422 body parsing branch present (1), ClientSidePreview mirrors backend regex flag + "unknown" sentinel + null-snapshot fall-through (3), SnapshotResponse type covers daemon_health (1). All 76 phrasebook tests pass (44 BE editor + 17 FE pins + 15 existing PR #167 read-side tests). tsc -b clean. vite build clean. Screenshots =========== web/docs/phrasebook-editor/edit_mode.png — edit-mode UI with client-side preview rendering a $0 reply web/docs/phrasebook-editor/validation.png — 422 response with 4 inline per-field errors (bad regex, unknown snapshot path, catastrophic backtracking, missing description) + top-level duplicate error web/docs/phrasebook-editor/revert.png — backups modal with newest-first list, per-row Revert confirm flow, greyed-out corrupt backup Design choices noted (no STOP-ASKs triggered) ============================================= * Snapshot field-path validation uses a STATIC allow-list (SNAPSHOT_SCALAR_PATHS), NOT a live snapshot walk. Spec §4 flagged this as a possible STOP-ASK; the static-list approach is what the spec offered as the alternative ("use SnapshotResponse static schema") and avoids the dynamic-snapshot-during-warm-up problem where freshly- booted daemons would mark canonical paths as invalid. Drift guard test pins each scalar leaf against the snapshot collector source. * Audit seam SeamName extension follows the probe.wake_requested precedent — extend Literal with a new value + add the comment explaining the future actor extension. No consumer drift. * Operator-defined category values: free-form for v1 (the runtime doesn't reserve any category names yet; promotion-loop bucket can add reserved-prefix discipline when its UX lands). Refs ==== * rafe-walker/kora-docs 17_cc_bucket_prompts/KR-FE-PHRASEBOOK-EDITOR-AND-CRUD_write_path_with_validation.md * PR #160 — phrasebook + dm_phrasebook module (read side) * PR #167 — KR-FE-PHRASEBOOK-VIEWER (read-only viewer this extends) * PR #170 — snapshot v4 daemon_health (referenced in SNAPSHOT_SCALAR_PATHS allow-list) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
snapshot.daemon_health— Kora's OWN runtime health, distinct fromsnapshot.service_health(external SaaS deps). Symmetric finish to #169 which closed the cost-fields side; sets up CC#2 to shift HealthHero off fan-out onto snapshot reads.Schema bumped 3 → 4. Single new section, no existing field shapes touched.
Bucket spec:
17_cc_bucket_prompts/KR-SNAPSHOT-DAEMON-HEALTH_listeners_uptime_errors.md.Listener inventory (K-DG grep finding)
14 listeners found (spec assumed 9). Per spec §4 STOP-ASK rule: divergence → describe + adapt, no PM ask. All discovered via
register_daemon_listener(name, factory):heartbeat_probes(kora_cli/listeners/heartbeat_probes_listener.py:116)snapshot(kora_cli/listeners/snapshot_listener.py:117)purelymail_client(kora_cli/listeners/purelymail_client_listener.py:138)probe_wake(kora_cli/listeners/probe_wake_listener.py:297)mcp(kora_cli/listeners/mcp.py:615)email_inbound_imap(kora_cli/listeners/email_inbound_imap_listener.py:292)heartbeat(kora_cli/listeners/heartbeat.py:193)web(kora_cli/listeners/web.py:125)cost_telemetry(kora_cli/listeners/cost_telemetry_listener.py:331)slack_client(kora_cli/listeners/slack_client_listener.py:142)mcp_consumption(kora_cli/listeners/mcp_consumption.py:304)webhooks(kora_cli/listeners/webhooks.py:352)reasoning_engine(kora_cli/listeners/reasoning_engine_listener.py:152)alert_notifier(kora_cli/listeners/alert_notifier_listener.py:298)Accessors: pre-existing vs added
current_coordinator()process-globalkora_cli/daemon.py:398(KR-D-DAEMON ST2)DaemonCoordinator.get_status()— state / uptime / listeners[].startedkora_cli/daemon.py:287(KR-D-DAEMON ST2)_startup_completed_atboot_ataccessor_startup_completed_wall_at+get_boot_at()read_audit_entries(since=...)(KR-AUDIT-PANEL-ENDPOINTS)last_event_at/consecutive_errorsPer-listener event tracking would touch 14 unrelated listener files for v1 with no consumer wired yet. Spec §4 explicitly allows "describe + adapt"; v1 surfaces what the coordinator already knows (
started: bool→up/down) and stubs the other fields as"unknown"so the wire shape is forward-compatible.Snapshot section shape (v4)
Derivation rules:
overall_status: 3+ listeners down OR ≥20 errors → unhealthy. 1-2 down OR 5-19 errors → degraded. All up + <5 errors → healthy. All listenersunknown(no coordinator) → unknown.boot_at+uptime_seconds:"unknown"while booting (startup not complete) or daemon not running.recent_error_count_5min: counts audit entries with seam in{webhook.dead_letter, slack_dm.reply_failed}ORnotification.dispatchedw/details.status == "failed". Window = 300s. Missing audit file → 0.CC#2 follow-on recommendation: KR-FE-DASHBOARD-HEALTH-SNAPSHOT-SHIFT
PR #162's
HealthHerois the next candidate to shift to snapshot reads:HealthHerocurrently fans out to multiple endpoints to reconstruct daemon health. With v4, a singledaemon_snapshot.jsonread returnsoverall_status+ per-listener + uptime + boot_at + recent errors./api/daemon/statusdirectly)."unknown"per-listenerlast_event_at/consecutive_errors, CC#2 can render listener up/down badges + an aggregate health pill fromoverall_status. The two"unknown"fields can be hidden behind a=== "unknown"check until v2 wires them.ListenerHealthRegistrythat listeners write to), the snapshot's per-listener dict will populate the same keys without breaking consumers — strictly additive at the value level ("unknown"→ real value).Test plan
tests/kora_cli/snapshot/test_daemon_health.py— 22 new tests (schema bump, overall_status thresholds, coordinator wired / unavailable / raises, boot_at populated / unknown, audit-tail counts: failure seams, notification status, time-window exclusion, missing file)tests/kora_cli/snapshot/test_state_snapshot.py— top-level keys + schema_version updated; 47 existing tests passruff checkclean on changed files🤖 Generated with Claude Code