feat(kora): KR-CHEAP-COST-TELEMETRY — per-route counters (R3-4 #10) by rafe-walker · Pull Request #161 · rafe-walker/kora

rafe-walker · 2026-05-24T01:48:06Z

Summary

Per Council R3 Lock R3-4 item #10. Eyes for the cost-economy discipline: every `record_inference` call gets a route tag; counters roll up per-route across 3 windows (process_lifetime, rolling_24h, monthly); cockpit + future tuning decisions read from this telemetry at $0 LLM cost.

Bucket spec: `17_cc_bucket_prompts/KR-CHEAP-COST-TELEMETRY_per_route_counters.md`

Routes wired vs deferred (per spec §4 STOP-ASK guidance)

Route	Call site	Disposition
`slack_dm`	`slack_dm_handler.py:676` (Kora reply-bill)	✅ wired this PR
`email_inbound`	(handler doesn't yet write a bill; engine path doesn't go through holder.record_inference)	⏸️ literal reserved — wire when email handler gets a cost-ladder write
`email_outbound_compose`	no consumer yet	⏸️ reserved
`mcp_tool`	no consumer yet	⏸️ reserved
`alert_investigation`	no consumer (Lock R3-8 (d) work)	⏸️ reserved
`probe_investigation`	no consumer (Lock R3-8 (b) work)	⏸️ reserved
`tool_loop_iteration`	engine doesn't yet surface iteration tag	⏸️ reserved
`scheduled_task`	no consumer yet	⏸️ reserved
(agent main turn)	`conversation_loop.py:1603`	leaves default `unknown` — agent-side, not in Kora taxonomy
(auxiliary client side-tasks)	`auxiliary_client.py:5341/5360`	leaves default `unknown` — agent-side

Every reserved route accepts the literal today; consumer wiring is a follow-on bucket. Per spec §4 "don't fail the bucket on missing call sites."

No `source=` harmonization needed

Spec §4 STOP-ASK #1: `record_inference` had no `source` kwarg today. The `IncomingMessage.source` field is at a different layer (engine input). Adding `route` is purely additive — zero existing-param conflict.

Surface

Layer	LOC
`kora_cli/telemetry/cost_telemetry.py` (NEW)	320 — `CostRouteTelemetry` + counters + windows + threading.RLock + singleton
`kora_cli/telemetry/init.py` (NEW)	45 — public re-exports
`kora_cli/listeners/cost_telemetry_listener.py` (NEW)	270 — persist task + window-reset tasks + path resolution + watch-and-act boundary checks
`kora_cli/listeners/init.py`	+6 — wire-in last
`kora_cli/snapshot/state_snapshot.py`	SCHEMA_VERSION 1→2 + `_collect_cost_telemetry` + section in compute_snapshot
`kora_cli/web_server.py`	+28 — `GET /api/cost_telemetry` endpoint
`kora_cli/handlers/slack_dm_handler.py`	+5 — tag `route="slack_dm"` on Kora reply-bill
`agent/cost_state_holder.py`	+30 — accept `route` + `escalated_to_opus` kwargs; telemetry hook
`agent/cost_ladder_wire.py`	+6 — forward kwargs
Tests	56 new (27 telemetry + 20 listener + 9 endpoint+snapshot-v2)

Counter shape per route

```json
{
"calls_count": 0,
"input_tokens_total": 0,
"output_tokens_total": 0,
"cache_read_tokens_total": 0,
"cache_creation_tokens_total": 0,
"cost_estimate_usd_total": 0.0,
"escalation_count": 0,
"model_breakdown": {}
}
```

Stable shape from process boot — every known route pre-populated in every window so consumers don't branch on absence.

3 windows

Window	Reset trigger
`process_lifetime`	process restart only
`rolling_24h`	UTC midnight (watch-and-act periodic task)
`monthly`	UTC month rollover (watch-and-act periodic task)

`process_lifetime` excluded from on-disk snapshot file to bound file size; `/api/cost_telemetry` endpoint returns ALL THREE.

Snapshot v2

`SCHEMA_VERSION` bumped 1 → 2. `compute_snapshot()` adds a `cost_telemetry` section exposing the two operator-facing windows (`rolling_24h` + `monthly`). End-to-end test (`test_api_snapshot_endpoint_includes_cost_telemetry`) confirms `/api/snapshot` returns the v2 shape.

Concurrency

`threading.RLock` on `CostRouteTelemetry` protects counter mutations + snapshot reads. Spec §4 STOP-ASK #3 says "propose lock strategy if non-trivial" — this one IS trivial (single coarse RLock around all writes/reads). Test `test_concurrent_record_call_no_lost_counts` proves 1000 calls across 10 threads sum exactly.

Read-only contract

This module is a READ-side observer of cost-ladder accounting. It does NOT mutate billing logic, the cost-ladder ladder rung, or the $200/mo budget enforcement. Disabling telemetry (singleton swap or import failure) is fail-soft per-route at the holder.record_inference seam.

Test plan

56 new tests pass (27 telemetry + 20 listener + 9 endpoint/v2)
141/141 focused regression (telemetry + listener + snapshot + endpoint + cost_state_holder + cost_ladder_wire)
Ruff clean
Backwards-compat: existing `record_inference` callers without `route=` keep working (default `"unknown"`)

Cascade

Recommended follow-on bucket dispatch for filling the deferred routes:

KR-EMAIL-COST-BILL — wire `holder.record_inference` into `email_inbound_handler._send_auto_reply` symmetric to slack_dm; tag `route="email_inbound"` for the inbound reply path
KR-MCP-TOOL-COST-TAG — tag MCP-driven reasoning paths with `route="mcp_tool"`
KR-REASONING-ITERATION-TAG — engine surfaces iteration index; iteration 2+ tagged `route="tool_loop_iteration"`
KR-PLUGIN-AUDIT-COST-TAG — alert/probe investigation reasoning paths tagged once the wakeup machinery lands

🤖 Generated with Claude Code

Per Council R3 Lock R3-4 item #10. Data layer for all future tuning decisions — escalation-rate, classifier, route-shape. Without per-route counters we're flying blind on whether cheap-substrate work is saving what we expect. # New module: kora_cli/telemetry/ * ``cost_telemetry.py`` — ``CostRouteTelemetry`` singleton with threading.RLock-protected counters across 3 windows (process_lifetime / rolling_24h / monthly). Per-route counters: calls_count, input/output/cache_read/cache_creation_tokens_total, cost_estimate_usd_total, escalation_count, model_breakdown. * ``__init__.py`` — re-exports + canonical route + window literals. Route taxonomy (v1, spec §2): slack_dm, email_inbound, email_outbound_compose, mcp_tool, alert_investigation, probe_investigation, tool_loop_iteration, scheduled_task, unknown. Fail-soft: unknown route strings bucket to "unknown" rather than raising. ``record_call`` wraps the inner write in try/except so hot-path inference handlers never see a telemetry failure. # Wire route through record_inference ``CostStateHolder.record_inference`` gains optional kwargs: - ``route: str = "unknown"`` — canonical taxonomy literal - ``escalated_to_opus: bool = False`` — Lock R3-3 tunable signal Both are additive + backwards-compatible. After successful pricing estimation, telemetry's ``record_call`` is invoked alongside the existing billing accumulation. Telemetry write is READ-side (does NOT affect billing) and fail-soft on import / record errors. ``cost_ladder_wire.record_inference_from_response`` forwards both kwargs verbatim. No source= harmonization needed (existing record_inference had no source kwarg; the IncomingMessage.source field is at a different layer — the engine input). # Route wiring this PR (wired vs deferred) | Route | Call site | Wired? | |---|---|---| | slack_dm | slack_dm_handler.py:676 (Kora reply-bill) | ✅ wired | | email_inbound | (handler doesn't yet write a bill) | ⏸️ reserved | | email_outbound_compose | (no consumer) | ⏸️ reserved | | mcp_tool | (no consumer) | ⏸️ reserved | | alert_investigation | (consumer pending KR-PLUGIN-AUDIT) | ⏸️ reserved | | probe_investigation | (consumer pending probe-audit) | ⏸️ reserved | | tool_loop_iteration | (engine doesn't yet surface iteration tag) | ⏸️ reserved | | scheduled_task | (no consumer) | ⏸️ reserved | | (agent main loop) | conversation_loop.py:1603 | leaves default "unknown" — agent-side, not Kora-side; not in spec taxonomy | | (auxiliary client) | auxiliary_client.py:5341/5360 | leaves default "unknown" — agent-side | Every reserved route accepts the literal today; consumer wiring is a follow-on bucket per the spec §4 STOP-ASK guidance ("don't fail the bucket on missing call sites"). # Periodic-task wiring (cost_telemetry_listener) 3 tasks registered with the heartbeat scheduler: * ``cost_telemetry.persist`` — 5min cadence; atomic-writes counter snapshot to ``${KORA_HOME}/cache/cost_telemetry.json`` * ``cost_telemetry.rolling_24h_reset`` — 1h watch-and-act; fires reset on UTC date crossover * ``cost_telemetry.monthly_reset`` — 1h watch-and-act; fires reset on UTC month rollover First-tick stamping pattern: the reset checks stamp "today's date" on first call after boot without firing a reset (counters at zero anyway), so resets only fire on subsequent boundary crossings. Avoids the trickiness of exact-midnight asyncio scheduling. # Web endpoint ``GET /api/cost_telemetry`` returns the in-memory counter snapshot across all 3 windows (no disk roundtrip; no LLM cost). # Snapshot v2 (KR-CHEAP-PRE-WARMED-SNAPSHOT extension) Bumped ``SCHEMA_VERSION`` 1 → 2. ``compute_snapshot()`` adds a ``cost_telemetry`` section exposing the two operator-facing windows (``rolling_24h`` + ``monthly``); ``process_lifetime`` intentionally excluded from the on-disk snapshot to keep file size bounded — operator hits ``/api/cost_telemetry`` for the full window set. Section degrades to empty dicts on telemetry unavailability (fail-soft). # Concurrency threading.RLock on CostRouteTelemetry protects counter mutations + snapshot reads. Same shape as cron/jobs.py + suitable for asyncio loop + cron + reasoning all potentially racing. Test ``test_concurrent_record_call_no_lost_counts`` proves 1000 calls across 10 threads → exact total (no lost increments). # Read-only contract preserved This module is a READ-side observer of cost-ladder accounting. It does NOT change billing accumulation logic. Telemetry can be disabled (singleton swap) without affecting the cost-ladder's $200/mo budget enforcement. # Tests 141/141 pass: * 27 cost_telemetry (counter shape + route taxonomy + windows + concurrency + singleton + JSON-serializability + fail-soft) * 20 listener (registration + cadence resolution + persist cycle + window-reset checks + read-back + lifecycle log) * 9 web endpoint + snapshot v2 (endpoint shape + version bump + cost_telemetry section + degradation + end-to-end) * 51 snapshot tests pass (1 updated for schema v2 + new top- level key) * 16 existing cost_state_holder + cost_ladder_wire tests pass (backwards-compat preserved across new optional kwargs) Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…or (#166) Closes the unified-operator-interface loop. Tails audit JSONL for probe.wake_requested events (PR #163 emits); per (probe, issue_category) inline debounce; invokes engine.respond() with structured probe context (issue + recent observations + envelope status); DMs operator via existing client.post_dm path. Activates route='probe_investigation' telemetry literal (PR #161 reserved). Engine reads message.source to derive route through existing record_inference site — no telemetry-side changes needed. Env vars added: KORA_PROBE_DEBOUNCE_SECONDS=600 (10 min default; 0 disables), KORA_PROBE_DEBOUNCE_BYPASS_CRITICAL=false (fail-closed; opt-in even for critical), KORA_PROBE_WAKE_POLL_SEC=30 (listener tail cadence). KORA_SLACK_JOSHUA_USER_ID reused from PR #149. All 4 STOP-ASK conditions resolved inline: - MessageSource Literal extended (1-line) with 'probe_investigation' + _derive_caller_session_id returns 'probe:{probe}:{category}' for future panel xref - Listener-coordinator wire uniform across 9 listeners (register_daemon_listener pattern) - Operator channel canonicalized at KORA_SLACK_JOSHUA_USER_ID (PR #149 precedent) - Tail-position stamping at first-tick (don't replay history at boot) — inverse of AlertNotifier's set-diff semantic; documented Wake-to-DM latency ~30s worst case (poll cadence), tunable to 5s. 42 new tests + 634/634 cross-bucket regression + ruff clean.

rafe-walker merged commit 93f0548 into feature/phase2-upgrades May 24, 2026

rafe-walker deleted the feat/kora-KR-CHEAP-COST-TELEMETRY branch May 24, 2026 01:55

rafe-walker mentioned this pull request May 24, 2026

research(kora): KR-FORK-HOOK-VERIFY — Hermes Gateway hook coverage report #168

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kora): KR-CHEAP-COST-TELEMETRY — per-route counters (R3-4 #10)#161

feat(kora): KR-CHEAP-COST-TELEMETRY — per-route counters (R3-4 #10)#161
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-CHEAP-COST-TELEMETRY

rafe-walker commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rafe-walker commented May 24, 2026

Summary

Routes wired vs deferred (per spec §4 STOP-ASK guidance)

No `source=` harmonization needed

Surface

Counter shape per route

3 windows

Snapshot v2

Concurrency

Read-only contract

Test plan

Cascade

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant