Skip to content
This repository was archived by the owner on May 26, 2026. It is now read-only.

feat(kora): KR-CHEAP-PRE-WARMED-SNAPSHOT — daemon state every 5 min at zero LLM cost#157

Merged
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-CHEAP-PRE-WARMED-SNAPSHOT
May 24, 2026
Merged

feat(kora): KR-CHEAP-PRE-WARMED-SNAPSHOT — daemon state every 5 min at zero LLM cost#157
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-CHEAP-PRE-WARMED-SNAPSHOT

Conversation

@rafe-walker

Copy link
Copy Markdown
Owner

Summary

Per Council R3 Lock R3-4 item #3. Pre-warmed daemon state snapshot computed every 5 min by a periodic task; cockpit + reasoning engine read from it at $0 LLM cost for routine status queries ("burn this week?", "any alerts?", "what's open?").

Foundational for probe-audit work + reasoning-engine routing-layer short-circuit; consumer wiring is separate bucket (KR-SNAPSHOT-INTO-ROUTING).

Bucket spec: `17_cc_bucket_prompts/KR-CHEAP-PRE-WARMED-SNAPSHOT_cron_to_engine_at_zero_llm_cost.md`

Populated vs degraded fields (v1)

Section Field Disposition
`operational_state.primary` ✅ populated from `get_holder().current.primary_state.value`
`operational_state.paused` ✅ derived
`operational_state.pause_reason` ✅ populated (null when empty set)
`alerts.active_count` ✅ from `compute_active_alerts()`
`alerts.by_severity` ✅ rollup
`alerts.by_category` ✅ rollup
`cost_ladder.current_tier` ✅ `get_cost_holder().active_rung().name` (e.g., "WARN_75")
`cost_ladder.monthly_budget_pct_used` ✅ `current_pct_used() * 100`
`cost_ladder.model_default` ⚠️ `"unknown"` — dynamic per-call downshift; no holder field
`service_health.{vercel,sentry,doppler,supabase,fly}` ✅ from `current_service_snapshots()` (per-probe degrade to "unknown" if no observation yet)
`tasks.open_count` ⏸️ `"unknown"` — deferred per spec §4
`tasks.in_progress_count` ⏸️ `"unknown"` — deferred per spec §4

Tasks deferred because Sea_Tickets substrate MCP from a 5-min periodic task at $0-LLM-cost premise breaks (each substrate read isn't free). Spec §4 STOP-ASK guidance applied: snapshot field shape stays stable with "unknown" placeholders so consumers can branch on presence; a follow-on bucket can wire a cached substrate-read accessor.

`model_default` degraded because the active model at any moment depends on the cost-ladder downshift in `agent.cost_downshift` (per-call routing). Surface a stable "unknown" rather than mislead. A follow-on bucket can add a per-rung default-model accessor if operators want it.

Surface

Layer LOC
`kora_cli/snapshot/state_snapshot.py` (NEW) 380 — per-section collectors + compute/write/read/freshness/cycle
`kora_cli/snapshot/init.py` (NEW) 65 — public surface + `get_snapshot_for_routing`
`kora_cli/listeners/snapshot_listener.py` (NEW) 130 — daemon listener + register_periodic_task `snapshot.compute`
`kora_cli/listeners/init.py` +8 — wire-in last so all upstream holders register first
`kora_cli/web_server.py` +33 — `GET /api/snapshot` endpoint
Tests 51 new (36 state_snapshot + 8 listener + 7 endpoint)

Design choice: heartbeat scheduler vs cron/jobs.py

Spec §2(b) allows either. Picked the heartbeat-scheduler path (matches MCP-CONSUMPTION health-check, alert-notifier, email IMAP poll, heartbeat probes). `cron/jobs.py` is the agent-driven cron — spawns external worker processes per fire. Overkill + expensive for a pure-Python state projection. Documented in listener module docstring.

Cadence: `KORA_SNAPSHOT_INTERVAL_SEC` (default 300s = 5 min, matching the `*/5 * * * *` cron pattern from spec).

Read-only contract preserved

This module is a read-only consumer of every source holder. No mutation of `agent/operational_state_holder`, `agent/cost_state_holder`, `kora_cli/heartbeat_probes`, or alerts aggregator. The snapshot is a projection, not a mirror — consumers wanting authoritative state still read the source-of-truth accessors.

Fail-soft contract

Each per-source collector is independently wrapped in try/except. One source failure degrades only that section. The top-level `compute_snapshot()` NEVER raises — proven by `test_compute_snapshot_full_degrade_does_not_raise` (every accessor blown up simultaneously; snapshot still computes with every field degraded).

`run_snapshot_cycle` (the periodic-task entry) catches compute + write failures + logs, so the heartbeat scheduler keeps ticking even when snapshot generation fails.

Atomic write

Uses `utils.atomic_replace` (same pattern `cron/jobs.py` uses for `jobs.json`). Tempfile written to the same parent dir, then renamed. No partial-write window.

CC#2 #137 fixture-isolation discipline applied

`/api/snapshot` endpoint tests patch `get_kora_home` in 3 namespaces (kora_constants + kora_cli.config + kora_cli.web_server) so parallel test workers can't see each other's snapshot files.

Test plan

  • 51 new tests pass (36 state_snapshot + 8 listener + 7 endpoint)
  • 437/437 cross-bucket regression (snapshot + alerts + all test_listeners)
  • Ruff clean
  • Listener registered in `LISTENER_REGISTRY` (verifiable via `list_jobs()`-equivalent — `PERIODIC_TASK_REGISTRY` contains `snapshot.compute`)
  • No mutation of `agent/operational_state_holder` (grep + design review)

Cascade

Standalone PR. Follow-on bucket dispatch suggestions:

  • KR-SNAPSHOT-INTO-ROUTING — wire `get_snapshot_for_routing()` into the reasoning engine's status-query short-circuit
  • KR-SNAPSHOT-TASKS — substrate Sea_Tickets read at a slower cadence (e.g., 30 min) writes to a separate snapshot section
  • KR-SNAPSHOT-MODEL-DEFAULT — per-rung default-model accessor on `cost_state_holder` so `model_default` populates real data

🤖 Generated with Claude Code

…t zero LLM cost

Per Council R3 Lock R3-4 item #3. Enables routine status queries
("burn this week?", "any alerts?", "what's open?") to be answered
at $0 LLM cost — engine reads the pre-computed snapshot instead
of tool-calling. Foundational infrastructure for probe-audit work
+ reasoning-engine routing-layer short-circuit (separate bucket
KR-SNAPSHOT-INTO-ROUTING wires the consumer side).

# New module: kora_cli/snapshot/

  * ``state_snapshot.py`` — pure projection from live read
    accessors (operational_state_holder + cost_state_holder +
    alerts aggregator + heartbeat probe snapshots). Per-source
    collectors are independently fail-soft: a single source
    failure degrades only that section to ``"unknown"`` (or null
    where shape requires).
  * ``__init__.py`` — public surface (compute / write / read /
    is_fresh / snapshot_path / run_snapshot_cycle / SCHEMA_VERSION)
    + ``get_snapshot_for_routing()`` convenience the future
    reasoning-engine routing-layer bucket consumes.

Snapshot file: ``${KORA_HOME}/cache/daemon_snapshot.json`` (atomic
write via existing ``utils.atomic_replace``, the same pattern
cron/jobs.py uses for jobs.json).

# Schema v1 — populated vs degraded

| Section | Field | Source | v1 disposition |
|---|---|---|---|
| operational_state | primary | get_holder().current.primary_state | ✅ populated |
| operational_state | paused | derived from primary | ✅ populated |
| operational_state | pause_reason | degradation_reasons[0].value when paused | ✅ populated (null when empty set) |
| alerts | active_count | len(compute_active_alerts()) | ✅ populated |
| alerts | by_severity | rollup of alerts | ✅ populated |
| alerts | by_category | rollup of alerts | ✅ populated |
| cost_ladder | current_tier | get_cost_holder().active_rung().name | ✅ populated |
| cost_ladder | monthly_budget_pct_used | get_cost_holder().current_pct_used() * 100 | ✅ populated |
| cost_ladder | model_default | dynamic per-call downshift (no holder field) | ⚠️ "unknown" v1 |
| service_health | {vercel,sentry,doppler,supabase,fly} | current_service_snapshots()[name].status | ✅ populated (per-probe degrade to "unknown" if absent) |
| tasks | open_count, in_progress_count | substrate Sea_Tickets read | ⏸️ "unknown" v1 (deferred per spec §4 — MCP call at 5-min cadence flagged ASK; follow-on bucket can wire cached substrate read) |

# Listener wiring

New ``kora_cli/listeners/snapshot_listener.py`` registers via
``register_daemon_listener("snapshot", factory)`` + the periodic
task ``snapshot.compute`` via ``register_periodic_task`` from the
heartbeat scheduler. Cadence default 300s (5 min);
``KORA_SNAPSHOT_INTERVAL_SEC`` env override.

Spec §2(b) says "extend cron/jobs.py OR new kora_cli/snapshot/__
init__.py"; picked the heartbeat-scheduler path (matches what
MCP-CONSUMPTION health-check, alert-notifier, email IMAP poll,
heartbeat probes all do — cheap in-process compute). cron/jobs.py
is the agent-driven cron for external worker processes;
overkill for a pure-Python state projection.

# Web endpoint

``GET /api/snapshot`` returns the snapshot dict verbatim when
fresh on disk; returns ``{"error": "no_snapshot", "stale": true}``
when missing or stale (>10 min). Cockpit + future routing layer
consume this without paying per-source fan-out.

# Read-only contract preserved

This module is a read-only consumer of every source holder. No
mutation of agent/operational_state_holder, agent/cost_state_
holder, kora_cli/heartbeat_probes, or alerts aggregator. The
snapshot is a projection, not a mirror — consumers wanting
authoritative state still read the source-of-truth accessors;
the snapshot is the cheap path for routine status queries.

# Tests

51 new tests pass:
  * 36 state_snapshot (shape + per-section degradation + full-
    degrade resilience + atomic write + read freshness gate +
    is_snapshot_fresh boundary tests + run_snapshot_cycle end-
    to-end + fail-soft on compute/write failure + get_snapshot_
    for_routing convenience)
  * 8 listener (registration in LISTENER_REGISTRY + periodic-task
    registration + cadence env resolution + lifecycle log lines)
  * 4 web endpoint (3-namespace get_kora_home fixture-isolation
    per CC#2 #137; fresh / missing / stale paths + shape pin)

437/437 cross-bucket regression (snapshot + alerts + all
test_listeners). Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rafe-walker rafe-walker merged commit 7489b68 into feature/phase2-upgrades May 24, 2026
@rafe-walker rafe-walker deleted the feat/kora-KR-CHEAP-PRE-WARMED-SNAPSHOT branch May 24, 2026 01:23
rafe-walker added a commit that referenced this pull request May 24, 2026
Lock R3-8 sub-cut (c) implementation. 34 panels instrumented (8 panels + 26 pages).

Backend: POST /api/panel_view → ${KORA_HOME}/panel_views.jsonl (Path B chosen — separate file from kora_audit_log.jsonl to preserve audit log's forensic semantics per CC#2's K-DG sweep).

Hook: web/src/hooks/usePanelView.ts — fire-and-forget POST on mount; silent failure (instrumentation must never break UX).

18/18 endpoint+pin tests + 210/210 regression. tsc -b + vite build clean.

Rebased onto current feature/phase2-upgrades (post #157 snapshot + #158 caching) to resolve adjacent-endpoint-addition conflict in kora_cli/web_server.py.
rafe-walker added a commit that referenced this pull request May 24, 2026
…it_pool_usd (schema v3) (#169)

Snapshot schema_version 2→3. Adds spent_to_date_usd + credit_pool_usd to cost_ladder section. KORA_CREDIT_POOL_USD env override (default $200 per reference-anthropic-sdk-billing-split); malformed/non-positive values warn + fall back.

Bonus: model_default now resolves via KR-HAIKU-ROUTER (#165) DEFAULT_HAIKU_MODEL — router-independent of cost holder, so PR #157's 'unknown' placeholder for that field is fully retired.

Unblocks CC#2 follow-on (CostCardBody shift to snapshot — sub-ms read, decoupled from cost-holder boot windows, lag bounded to 5-min cron cadence). /api/cost-telemetry retained as live source for pre-decision reads.

47 snapshot tests + 159-test cross-bucket regression + ruff clean.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant