This repository was archived by the owner on May 26, 2026. It is now read-only.
feat(kora): KR-CHEAP-PRE-WARMED-SNAPSHOT — daemon state every 5 min at zero LLM cost#157
Merged
rafe-walker merged 1 commit intoMay 24, 2026
Conversation
…t zero LLM cost Per Council R3 Lock R3-4 item #3. Enables routine status queries ("burn this week?", "any alerts?", "what's open?") to be answered at $0 LLM cost — engine reads the pre-computed snapshot instead of tool-calling. Foundational infrastructure for probe-audit work + reasoning-engine routing-layer short-circuit (separate bucket KR-SNAPSHOT-INTO-ROUTING wires the consumer side). # New module: kora_cli/snapshot/ * ``state_snapshot.py`` — pure projection from live read accessors (operational_state_holder + cost_state_holder + alerts aggregator + heartbeat probe snapshots). Per-source collectors are independently fail-soft: a single source failure degrades only that section to ``"unknown"`` (or null where shape requires). * ``__init__.py`` — public surface (compute / write / read / is_fresh / snapshot_path / run_snapshot_cycle / SCHEMA_VERSION) + ``get_snapshot_for_routing()`` convenience the future reasoning-engine routing-layer bucket consumes. Snapshot file: ``${KORA_HOME}/cache/daemon_snapshot.json`` (atomic write via existing ``utils.atomic_replace``, the same pattern cron/jobs.py uses for jobs.json). # Schema v1 — populated vs degraded | Section | Field | Source | v1 disposition | |---|---|---|---| | operational_state | primary | get_holder().current.primary_state | ✅ populated | | operational_state | paused | derived from primary | ✅ populated | | operational_state | pause_reason | degradation_reasons[0].value when paused | ✅ populated (null when empty set) | | alerts | active_count | len(compute_active_alerts()) | ✅ populated | | alerts | by_severity | rollup of alerts | ✅ populated | | alerts | by_category | rollup of alerts | ✅ populated | | cost_ladder | current_tier | get_cost_holder().active_rung().name | ✅ populated | | cost_ladder | monthly_budget_pct_used | get_cost_holder().current_pct_used() * 100 | ✅ populated | | cost_ladder | model_default | dynamic per-call downshift (no holder field) |⚠️ "unknown" v1 | | service_health | {vercel,sentry,doppler,supabase,fly} | current_service_snapshots()[name].status | ✅ populated (per-probe degrade to "unknown" if absent) | | tasks | open_count, in_progress_count | substrate Sea_Tickets read | ⏸️ "unknown" v1 (deferred per spec §4 — MCP call at 5-min cadence flagged ASK; follow-on bucket can wire cached substrate read) | # Listener wiring New ``kora_cli/listeners/snapshot_listener.py`` registers via ``register_daemon_listener("snapshot", factory)`` + the periodic task ``snapshot.compute`` via ``register_periodic_task`` from the heartbeat scheduler. Cadence default 300s (5 min); ``KORA_SNAPSHOT_INTERVAL_SEC`` env override. Spec §2(b) says "extend cron/jobs.py OR new kora_cli/snapshot/__ init__.py"; picked the heartbeat-scheduler path (matches what MCP-CONSUMPTION health-check, alert-notifier, email IMAP poll, heartbeat probes all do — cheap in-process compute). cron/jobs.py is the agent-driven cron for external worker processes; overkill for a pure-Python state projection. # Web endpoint ``GET /api/snapshot`` returns the snapshot dict verbatim when fresh on disk; returns ``{"error": "no_snapshot", "stale": true}`` when missing or stale (>10 min). Cockpit + future routing layer consume this without paying per-source fan-out. # Read-only contract preserved This module is a read-only consumer of every source holder. No mutation of agent/operational_state_holder, agent/cost_state_ holder, kora_cli/heartbeat_probes, or alerts aggregator. The snapshot is a projection, not a mirror — consumers wanting authoritative state still read the source-of-truth accessors; the snapshot is the cheap path for routine status queries. # Tests 51 new tests pass: * 36 state_snapshot (shape + per-section degradation + full- degrade resilience + atomic write + read freshness gate + is_snapshot_fresh boundary tests + run_snapshot_cycle end- to-end + fail-soft on compute/write failure + get_snapshot_ for_routing convenience) * 8 listener (registration in LISTENER_REGISTRY + periodic-task registration + cadence env resolution + lifecycle log lines) * 4 web endpoint (3-namespace get_kora_home fixture-isolation per CC#2 #137; fresh / missing / stale paths + shape pin) 437/437 cross-bucket regression (snapshot + alerts + all test_listeners). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rafe-walker
added a commit
that referenced
this pull request
May 24, 2026
Lock R3-8 sub-cut (c) implementation. 34 panels instrumented (8 panels + 26 pages).
Backend: POST /api/panel_view → ${KORA_HOME}/panel_views.jsonl (Path B chosen — separate file from kora_audit_log.jsonl to preserve audit log's forensic semantics per CC#2's K-DG sweep).
Hook: web/src/hooks/usePanelView.ts — fire-and-forget POST on mount; silent failure (instrumentation must never break UX).
18/18 endpoint+pin tests + 210/210 regression. tsc -b + vite build clean.
Rebased onto current feature/phase2-upgrades (post #157 snapshot + #158 caching) to resolve adjacent-endpoint-addition conflict in kora_cli/web_server.py.
This was referenced May 24, 2026
Merged
rafe-walker
added a commit
that referenced
this pull request
May 24, 2026
…it_pool_usd (schema v3) (#169) Snapshot schema_version 2→3. Adds spent_to_date_usd + credit_pool_usd to cost_ladder section. KORA_CREDIT_POOL_USD env override (default $200 per reference-anthropic-sdk-billing-split); malformed/non-positive values warn + fall back. Bonus: model_default now resolves via KR-HAIKU-ROUTER (#165) DEFAULT_HAIKU_MODEL — router-independent of cost holder, so PR #157's 'unknown' placeholder for that field is fully retired. Unblocks CC#2 follow-on (CostCardBody shift to snapshot — sub-ms read, decoupled from cost-holder boot windows, lag bounded to 5-min cron cadence). /api/cost-telemetry retained as live source for pre-decision reads. 47 snapshot tests + 159-test cross-bucket regression + ruff clean.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Per Council R3 Lock R3-4 item #3. Pre-warmed daemon state snapshot computed every 5 min by a periodic task; cockpit + reasoning engine read from it at $0 LLM cost for routine status queries ("burn this week?", "any alerts?", "what's open?").
Foundational for probe-audit work + reasoning-engine routing-layer short-circuit; consumer wiring is separate bucket (KR-SNAPSHOT-INTO-ROUTING).
Bucket spec: `17_cc_bucket_prompts/KR-CHEAP-PRE-WARMED-SNAPSHOT_cron_to_engine_at_zero_llm_cost.md`
Populated vs degraded fields (v1)
Tasks deferred because Sea_Tickets substrate MCP from a 5-min periodic task at $0-LLM-cost premise breaks (each substrate read isn't free). Spec §4 STOP-ASK guidance applied: snapshot field shape stays stable with "unknown" placeholders so consumers can branch on presence; a follow-on bucket can wire a cached substrate-read accessor.
`model_default` degraded because the active model at any moment depends on the cost-ladder downshift in `agent.cost_downshift` (per-call routing). Surface a stable "unknown" rather than mislead. A follow-on bucket can add a per-rung default-model accessor if operators want it.
Surface
Design choice: heartbeat scheduler vs cron/jobs.py
Spec §2(b) allows either. Picked the heartbeat-scheduler path (matches MCP-CONSUMPTION health-check, alert-notifier, email IMAP poll, heartbeat probes). `cron/jobs.py` is the agent-driven cron — spawns external worker processes per fire. Overkill + expensive for a pure-Python state projection. Documented in listener module docstring.
Cadence: `KORA_SNAPSHOT_INTERVAL_SEC` (default 300s = 5 min, matching the `*/5 * * * *` cron pattern from spec).
Read-only contract preserved
This module is a read-only consumer of every source holder. No mutation of `agent/operational_state_holder`, `agent/cost_state_holder`, `kora_cli/heartbeat_probes`, or alerts aggregator. The snapshot is a projection, not a mirror — consumers wanting authoritative state still read the source-of-truth accessors.
Fail-soft contract
Each per-source collector is independently wrapped in try/except. One source failure degrades only that section. The top-level `compute_snapshot()` NEVER raises — proven by `test_compute_snapshot_full_degrade_does_not_raise` (every accessor blown up simultaneously; snapshot still computes with every field degraded).
`run_snapshot_cycle` (the periodic-task entry) catches compute + write failures + logs, so the heartbeat scheduler keeps ticking even when snapshot generation fails.
Atomic write
Uses `utils.atomic_replace` (same pattern `cron/jobs.py` uses for `jobs.json`). Tempfile written to the same parent dir, then renamed. No partial-write window.
CC#2 #137 fixture-isolation discipline applied
`/api/snapshot` endpoint tests patch `get_kora_home` in 3 namespaces (kora_constants + kora_cli.config + kora_cli.web_server) so parallel test workers can't see each other's snapshot files.
Test plan
Cascade
Standalone PR. Follow-on bucket dispatch suggestions:
🤖 Generated with Claude Code