Skip to content
This repository was archived by the owner on May 26, 2026. It is now read-only.

KR-SNAPSHOT-DAEMON-HEALTH — daemon's own listeners/uptime/errors (schema v4)#170

Merged
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-SNAPSHOT-DAEMON-HEALTH
May 24, 2026
Merged

KR-SNAPSHOT-DAEMON-HEALTH — daemon's own listeners/uptime/errors (schema v4)#170
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-SNAPSHOT-DAEMON-HEALTH

Conversation

@rafe-walker

Copy link
Copy Markdown
Owner

Summary

Adds snapshot.daemon_health — Kora's OWN runtime health, distinct from snapshot.service_health (external SaaS deps). Symmetric finish to #169 which closed the cost-fields side; sets up CC#2 to shift HealthHero off fan-out onto snapshot reads.

Schema bumped 3 → 4. Single new section, no existing field shapes touched.

Bucket spec: 17_cc_bucket_prompts/KR-SNAPSHOT-DAEMON-HEALTH_listeners_uptime_errors.md.

Listener inventory (K-DG grep finding)

14 listeners found (spec assumed 9). Per spec §4 STOP-ASK rule: divergence → describe + adapt, no PM ask. All discovered via register_daemon_listener(name, factory):

  1. heartbeat_probes (kora_cli/listeners/heartbeat_probes_listener.py:116)
  2. snapshot (kora_cli/listeners/snapshot_listener.py:117)
  3. purelymail_client (kora_cli/listeners/purelymail_client_listener.py:138)
  4. probe_wake (kora_cli/listeners/probe_wake_listener.py:297)
  5. mcp (kora_cli/listeners/mcp.py:615)
  6. email_inbound_imap (kora_cli/listeners/email_inbound_imap_listener.py:292)
  7. heartbeat (kora_cli/listeners/heartbeat.py:193)
  8. web (kora_cli/listeners/web.py:125)
  9. cost_telemetry (kora_cli/listeners/cost_telemetry_listener.py:331)
  10. slack_client (kora_cli/listeners/slack_client_listener.py:142)
  11. mcp_consumption (kora_cli/listeners/mcp_consumption.py:304)
  12. webhooks (kora_cli/listeners/webhooks.py:352)
  13. reasoning_engine (kora_cli/listeners/reasoning_engine_listener.py:152)
  14. alert_notifier (kora_cli/listeners/alert_notifier_listener.py:298)

Accessors: pre-existing vs added

Concern Pre-existing Added in this PR
current_coordinator() process-global kora_cli/daemon.py:398 (KR-D-DAEMON ST2)
DaemonCoordinator.get_status() — state / uptime / listeners[].started kora_cli/daemon.py:287 (KR-D-DAEMON ST2)
Monotonic _startup_completed_at ✅ for uptime calc
Wall-clock boot_at accessor ❌ monotonic can't ISO-serialize _startup_completed_wall_at + get_boot_at()
Audit JSONL reader with time filter read_audit_entries(since=...) (KR-AUDIT-PANEL-ENDPOINTS)
Per-listener last_event_at / consecutive_errors ❌ no listener exposes these deferred to v2 — see follow-on note below

Per-listener event tracking would touch 14 unrelated listener files for v1 with no consumer wired yet. Spec §4 explicitly allows "describe + adapt"; v1 surfaces what the coordinator already knows (started: boolup / down) and stubs the other fields as "unknown" so the wire shape is forward-compatible.

Snapshot section shape (v4)

"daemon_health": {
  "overall_status": "healthy",
  "boot_at": "2026-05-23T10:00:00Z",
  "uptime_seconds": 14400.0,
  "listeners": {
    "snapshot": {"status": "up", "last_event_at": "unknown", "consecutive_errors": "unknown"},
    ...14 listeners total
  },
  "recent_error_count_5min": 0
}

Derivation rules:

  • overall_status: 3+ listeners down OR ≥20 errors → unhealthy. 1-2 down OR 5-19 errors → degraded. All up + <5 errors → healthy. All listeners unknown (no coordinator) → unknown.
  • boot_at + uptime_seconds: "unknown" while booting (startup not complete) or daemon not running.
  • recent_error_count_5min: counts audit entries with seam in {webhook.dead_letter, slack_dm.reply_failed} OR notification.dispatched w/ details.status == "failed". Window = 300s. Missing audit file → 0.

CC#2 follow-on recommendation: KR-FE-DASHBOARD-HEALTH-SNAPSHOT-SHIFT

PR #162's HealthHero is the next candidate to shift to snapshot reads:

  1. HealthHero currently fans out to multiple endpoints to reconstruct daemon health. With v4, a single daemon_snapshot.json read returns overall_status + per-listener + uptime + boot_at + recent errors.
  2. The 5-min cron cadence is acceptable for a hero panel (operators don't need sub-second listener health; for sub-minute they can hit /api/daemon/status directly).
  3. The wire shape is complete enough today for the hero — even with v1's "unknown" per-listener last_event_at / consecutive_errors, CC#2 can render listener up/down badges + an aggregate health pill from overall_status. The two "unknown" fields can be hidden behind a === "unknown" check until v2 wires them.
  4. Stop-gap for v2 enrichment: when a future bucket adds per-listener event/error tracking (e.g., via a ListenerHealthRegistry that listeners write to), the snapshot's per-listener dict will populate the same keys without breaking consumers — strictly additive at the value level ("unknown" → real value).

Test plan

  • tests/kora_cli/snapshot/test_daemon_health.py — 22 new tests (schema bump, overall_status thresholds, coordinator wired / unavailable / raises, boot_at populated / unknown, audit-tail counts: failure seams, notification status, time-window exclusion, missing file)
  • tests/kora_cli/snapshot/test_state_snapshot.py — top-level keys + schema_version updated; 47 existing tests pass
  • Regression on snapshot + daemon + audit + telemetry: 182 passed
  • ruff check clean on changed files

🤖 Generated with Claude Code

…/errors (schema v4)

* Bump snapshot schema_version 3 → 4.
* Add `snapshot.daemon_health` section — Kora's own runtime health,
  distinct from `service_health` (external SaaS dependencies):
  - `overall_status` — healthy / degraded / unhealthy / unknown,
    derived from per-listener `down` count + recent error count.
  - `boot_at` — ISO 8601 UTC wall-clock stamp at startup completion;
    populated from new `DaemonCoordinator.get_boot_at()` accessor.
    `"unknown"` while booting or when daemon isn't running.
  - `uptime_seconds` — reuses existing `get_status()` monotonic
    uptime field.
  - `listeners` — per-listener `{status, last_event_at,
    consecutive_errors}`. `status` derived from coordinator's
    `started` flag; `last_event_at` + `consecutive_errors` are
    `"unknown"` in v1 (no listener exposes those today; per-
    listener event tracking is a follow-on bucket).
  - `recent_error_count_5min` — tails `kora_audit_log.jsonl` for
    failure-shaped seams (`webhook.dead_letter`,
    `slack_dm.reply_failed`, `notification.dispatched` with
    `status="failed"`). Reuses the existing
    `read_audit_entries(since=...)` reader. Fail-soft: missing
    file → 0.
* Coordinator gets a wall-clock `_startup_completed_wall_at`
  stamped in lockstep with the existing monotonic
  `_startup_completed_at`, exposed via `get_boot_at()`. No
  existing accessor returned wall-clock boot time; monotonic
  alone can't serialize to an ISO timestamp.
* No per-listener files modified — listener-level event tracking is
  out of scope for v1 (spec §4 "describe + adapt"; 14 listeners
  found vs 9 in spec — modifying 14 unrelated files for v1
  would be churn). The shape is forward-compatible: future
  buckets fill in `last_event_at` / `consecutive_errors` per
  listener without breaking consumers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rafe-walker rafe-walker merged commit b9d5e92 into feature/phase2-upgrades May 24, 2026
@rafe-walker rafe-walker deleted the feat/kora-KR-SNAPSHOT-DAEMON-HEALTH branch May 24, 2026 03:13
rafe-walker added a commit that referenced this pull request May 24, 2026
…n holdouts (#174)

4-of-4 dashboard hero fields snapshot-driven on warm cache. Verified at DashboardPage.tsx:1270-1345 — loadInitial does NOT call api.getOperationalState / getCurrentAlerts / getCostState / getHealthRollup with fresh snapshot.

PR #162 anti-projection tests FLIPPED (not deleted): test_dashboard_does_not_project_*_from_snapshot → test_dashboard_projects_*_from_snapshot with null-return + per-field fallback pins. New pins: badge surfaces N-of-M hero count + DASHBOARD_HERO_FIELD_COUNT===4 literal regression guard.

K-DG drift caught + fixed: SnapshotResponse TS type was stale — missing cost_ladder.spent_to_date_usd + credit_pool_usd (#169) + the entire daemon_health section (#170). Added; required before projection helpers could compile.

Per-card retry buttons + forceFullLiveRefresh still fan out by design (operator-triggered force paths). 19/19 tests pass, tsc + vite build clean.
rafe-walker pushed a commit that referenced this pull request May 24, 2026
…tion + revert

Build-list follow-on to PR #167 (phrasebook read-only viewer).
Adds the operator-edit story so the phrasebook can be modified
from the cockpit instead of YAML by hand. Also sets the UX
foundation that the eventual promotion-review panel will reuse
(KR-PROMOTE-PHRASEBOOK proposals are pending edits to this same
surface).

Backend (kora_cli/short_circuit/phrasebook_editor.py — new)
==========================================================

Module hosts everything the read-only viewer didn't need:

  * validate_entries(entries) → [EntryValidationError, ...]
    8 checks (in order; later skipped for an entry that failed
    earlier to avoid noise): payload is list, length cap (200
    entries), required-fields-non-empty, length caps on each
    field, regex compiles with re.IGNORECASE, catastrophic-
    backtracking guard (catches (a+)+, (.*)*, etc), snapshot
    placeholder paths resolve to known scalars, no duplicate
    (pattern, category) tuples.

  * SNAPSHOT_SCALAR_PATHS — static frozen-set of known snapshot
    v4 scalar paths the operator can reference in
    {snapshot.X.Y} placeholders. Drift-guarded by
    test_static_schema_matches_snapshot_collectors which greps
    each leaf key against the snapshot collector source. Only
    scalars (no dynamic-key dicts like listeners.X or
    cost_telemetry.X — those str() to Python repr and would
    render garbage in operator-facing DMs).

  * write_backup_for / rotate_backups / list_backups —
    timestamped backups under ${KORA_HOME}/phrasebook/backups/.
    KORA_PHRASEBOOK_BACKUP_COUNT env var tunes rotation
    (default 10; clamped to [1, 1000]). Filenames sort
    chronologically as plain strings (ISO-Z format).

  * write_phrasebook(entries) — atomic write via
    utils.atomic_replace (same pattern as snapshot writer).
    Deterministic field order for readable YAML diffs.

  * revert_phrasebook(filename=...) — revert to a specific
    backup OR (no filename) the most-recent OR (no backups)
    remove the override entirely. Path-traversal defense
    rejects any filename containing /, \, .., or not matching
    the slack_dm.*.yml shape.

Backend (kora_cli/web_server.py — 3 new endpoints)
==================================================

  PUT  /api/phrasebook/slack_dm           — validate + back up +
                                            atomic-write + audit
  POST /api/phrasebook/slack_dm/revert    — revert to backup
  GET  /api/phrasebook/slack_dm/backups   — list backups

PUT contract:
  * Validation fails → 422 with structured per-entry errors;
    NO write; previous override preserved.
  * Validation passes → backup current (if exists) → atomic
    write → rotate backups → audit row → 200 with echoed
    entries + backup_filename + rotated_count.
  * Write itself fails → 500; backup preserved so operator
    can recover.

Audit seam (new): phrasebook.updated
====================================

Extended SeamName Literal in kora_cli/audit/jsonl_sink.py.
Each successful PUT (or revert) emits one entry with:
  * actor          — "operator" (cockpit-driven; future
                      "kora_proposal_approved" from the
                      promotion-loop bucket reuses this shape)
  * action         — "put" | "revert"
  * entry_count_before / entry_count_after
  * backup_filename (when applicable)
  * rotated_backup_count (when applicable)
  * reverted_to (when action=revert)

Drives the future KR-PROMOTION-REVIEW-PANEL via the existing
audit-panel infrastructure (KR-AUDIT-PANEL-ENDPOINTS PR #155).

Frontend
========

  * web/src/pages/PhrasebookEditor.tsx (new) — hosts the
    editor sub-components so PhrasebookPage stays readable:
    EntryEditorRow (4 inline editable fields + per-field
    validation errors), EditModeControls (Save / Cancel / Add),
    BackupsDialog (modal with newest-first list + per-row
    revert + confirm), ClientSidePreview (mirrors
    dm_phrasebook.match + render_reply in TS so operator can
    preview in-progress edits without saving).

  * api wrappers + types: putSlackDmPhrasebook /
    revertSlackDmPhrasebook / getSlackDmPhrasebookBackups +
    PhrasebookEntryWrite / PhrasebookPutResponse /
    PhrasebookValidationErrorEntry/Body /
    PhrasebookRevertResponse / PhrasebookBackupItem /
    PhrasebookBackupsResponse.

  * PhrasebookPage extended with edit-mode toggle. View-mode
    surface (read-only table + live tester) is unchanged for
    operators who just want to inspect. Edit-mode swaps to
    editor rows + client-side preview + Save/Cancel/Add.
    Backups button in view-mode opens the revert dialog.

  * 422 validation-error parsing: Save handler unmarshals
    fetchJSON's "STATUS: BODY" Error message; on 422 with
    error="validation_failed" the body's per-entry errors are
    routed to the editor for inline rendering.

Tests
=====

  Backend (tests/kora_cli/test_phrasebook_editor.py — 44 tests):

    Validation (14): valid entries, non-list, missing field,
      empty whitespace, all 4 length caps, entries count cap,
      invalid regex, catastrophic-backtracking parametrized
      across 4 pathological patterns, unknown vs known snapshot
      path, duplicate (pattern, category), schema drift guard
      against state_snapshot.py.
    Backups (5): no-override returns None, copies content +
      timestamps filename, rotation keeps N most recent, env
      override + clamping parametrized, list returns newest-
      first with entry_count.
    Write+revert (5): round-trip load, revert by name, revert
      most-recent, revert with no backups removes override,
      path-traversal rejection parametrized across 5 attacks.
    Endpoints (7): PUT valid + audit, PUT invalid + preserve,
      PUT no-existing-override, POST revert + audit, POST
      revert invalid filename → 400, POST revert missing →
      404, GET backups newest-first.
    Audit (1): SeamName Literal includes phrasebook.updated.
    Integration (1): round-trip PUT then revert restores seed.

  FE source-pins (tests/kora_cli/test_phrasebook_editor_fe_pins.py
  — 17 tests):

    api wrappers (3), TS types declared + shape pinned (2),
    editor file exists + exports (1), page wiring (3), edit
    button hidden in edit-mode (1), 422 body parsing branch
    present (1), ClientSidePreview mirrors backend regex flag
    + "unknown" sentinel + null-snapshot fall-through (3),
    SnapshotResponse type covers daemon_health (1).

  All 76 phrasebook tests pass (44 BE editor + 17 FE pins +
  15 existing PR #167 read-side tests).

  tsc -b clean. vite build clean.

Screenshots
===========

  web/docs/phrasebook-editor/edit_mode.png — edit-mode UI with
    client-side preview rendering a $0 reply
  web/docs/phrasebook-editor/validation.png — 422 response with
    4 inline per-field errors (bad regex, unknown snapshot path,
    catastrophic backtracking, missing description) + top-level
    duplicate error
  web/docs/phrasebook-editor/revert.png — backups modal with
    newest-first list, per-row Revert confirm flow, greyed-out
    corrupt backup

Design choices noted (no STOP-ASKs triggered)
=============================================

  * Snapshot field-path validation uses a STATIC allow-list
    (SNAPSHOT_SCALAR_PATHS), NOT a live snapshot walk. Spec §4
    flagged this as a possible STOP-ASK; the static-list
    approach is what the spec offered as the alternative
    ("use SnapshotResponse static schema") and avoids the
    dynamic-snapshot-during-warm-up problem where freshly-
    booted daemons would mark canonical paths as invalid.
    Drift guard test pins each scalar leaf against the
    snapshot collector source.

  * Audit seam SeamName extension follows the
    probe.wake_requested precedent — extend Literal with a
    new value + add the comment explaining the future
    actor extension. No consumer drift.

  * Operator-defined category values: free-form for v1
    (the runtime doesn't reserve any category names yet;
    promotion-loop bucket can add reserved-prefix
    discipline when its UX lands).

Refs
====

  * rafe-walker/kora-docs
    17_cc_bucket_prompts/KR-FE-PHRASEBOOK-EDITOR-AND-CRUD_write_path_with_validation.md
  * PR #160 — phrasebook + dm_phrasebook module (read side)
  * PR #167 — KR-FE-PHRASEBOOK-VIEWER (read-only viewer this extends)
  * PR #170 — snapshot v4 daemon_health (referenced in
    SNAPSHOT_SCALAR_PATHS allow-list)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant