Skip to content
This repository was archived by the owner on May 26, 2026. It is now read-only.

feat(kora): KR-CHEAP-PROMPT-CACHING — ephemeral cache on system prompt + tool descriptions#158

Merged
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-CHEAP-PROMPT-CACHING
May 24, 2026
Merged

feat(kora): KR-CHEAP-PROMPT-CACHING — ephemeral cache on system prompt + tool descriptions#158
rafe-walker merged 1 commit into
feature/phase2-upgradesfrom
feat/kora-KR-CHEAP-PROMPT-CACHING

Conversation

@rafe-walker

Copy link
Copy Markdown
Owner

Summary

Prompt caching active. Expected ~50% input-token cost reduction on warm cache. Baseline-setting for subsequent cheap-substrate work (per Council R3 Lock R3-4 — this ships FIRST so post-caching telemetry measures the post-caching world).

Call sites with `cache_control`

Site What's cached How
`anthropic_engine.py:_tool_use_loop` — `kwargs["system"]` Entire `kora_system_prompt.md` content Bare string → `[{type: text, text: ..., cache_control: {type: ephemeral}}]` content-block list via new helper `_wrap_system_as_cacheable`
`anthropic_engine.py:_tool_use_loop` — `kwargs["tools"]` Full tool list (5 read-only reasoning tools) Last tool gets `cache_control: ephemeral` via new helper `_wrap_tools_as_cacheable`. Marker on last block covers everything before it per Anthropic API semantics. Input list never mutated.

Both structures are built once outside the iteration loop so byte-identical kwargs go to the SDK on every roundtrip — cache key matches → reads hit. Verified by `test_engine_sends_same_system_and_tools_each_iteration`.

Empty-tool-list path preserved: registry fails → `tools=[]` → kwargs["tools"] omitted entirely (some SDK versions reject empty arrays).

Cost-ladder accounting patch

The infrastructure ALREADY supported cache tokens (`CanonicalUsage(cache_read_tokens, cache_write_tokens)` + `PricingEntry.cache_*_cost_per_million` — Opus 4.7 priced at $0.50 read / $6.25 write per million). The gap was the engine→handler→holder data flow. Six-step patch:

  1. `ResponseResult` extended with `cache_creation_input_tokens` + `cache_read_input_tokens` (default 0, backwards-compatible).
  2. `_tool_use_loop` accumulates both per-iteration from `usage.cache_creation_input_tokens` / `usage.cache_read_input_tokens` (`getattr`-default 0 for older SDK or uncached call).
  3. `_project_final_response` signature + body pass cache totals through. Max-iter exit + projection-failed branches also wired.
  4. `slack_dm_handler.py` `reasoning_meta` dict gained the two cache fields at init + the engine-exception branch + the success-path meta build.
  5. `_record_inference_to_cost_ladder` reads cache fields from meta + builds `CanonicalUsage(input, output, cache_write_tokens, cache_read_tokens)`. Early-return guard now checks ALL four token buckets so a pure-cache-read call still bills.
  6. `_append_outbound_log_entry` extended with the same kwargs so cache tokens persist into the slack DM outbound JSONL — reasoning panel can surface per-call cache-hit rates without grepping logs.

K-DG verification

Anthropic SDK 0.87.0 (one minor ahead of spec's 0.86.0) verified to expose `CacheControlEphemeralParam`, `TextBlockParam`, `ToolParam` — cache_control syntax identical between minor versions per https://platform.claude.com/docs/en/agents-and-tools/prompt-caching. The cost-ladder code path was K-DG'd BEFORE edits: `grep -rn record_inference` surfaced the staticmethod at `slack_dm_handler.py:645` + the `PricingEntry` table in `agent/usage_pricing.py` with rates for every Claude 4.X model.

Tests

16 new tests in `tests/kora_cli/reasoning/test_anthropic_engine_caching.py` covering:

  • Wrapper unit tests (system shape, tool shape, mutation safety, empty + single-tool edge cases)
  • Engine integration (system + tools shape on SDK call; tools omitted when registry empty; SAME structures across iterations for cache key stability)
  • ResponseResult surfaces accumulated cache totals; tolerates missing usage attrs
  • Handler cost-ladder integration (cache fields → CanonicalUsage; None → 0; pure-cache-read still bills; all-zero skips)

1 pre-existing test updated in `test_anthropic_engine.py`: `test_system_prompt_passed_to_sdk` now asserts the content-block list shape.

Test plan

  • 16 new caching tests pass
  • 483/483 reasoning + handlers + listeners pass serially
  • Full repo xdist: 9228 passed; 43 failed identical to baseline (test_anthropic_adapter / test_backup / test_config / test_gateway_* / test_web_server* / test_kanban_db / test_list_picker_providers / test_model_switch_* / test_startup_plugin_gating). Zero failures in reasoning or handlers.
  • No regression in tool-use loop semantics (parallel dispatch, audit emit, max-iteration cap, error isolation)
  • No regression in SlackDM outbound JSONL backwards compat (cache fields use the same None-omit semantic as the existing reasoning fields)

🤖 Generated with Claude Code

…t + tool descriptions

Per Council R3 Lock R3-4, this is the build-list #1 baseline-
setter: prompt caching ships FIRST so subsequent cheap-substrate
telemetry measures the post-caching world. Expected ~50% input-
token cost reduction on warm reasoning calls.

# Two cache breakpoints (Anthropic API allows 4)

System prompt + tool descriptions both marked with
``cache_control: {"type": "ephemeral"}``. The marker on a block
caches everything UP TO AND INCLUDING that block, so two
breakpoints cover both static input regions.

# Call sites with cache_control

  1. ``kora_cli/reasoning/anthropic_engine.py:_tool_use_loop`` —
     SDK kwargs["system"] is now a content-block list (was bare
     string) with cache_control on its single text block. Built
     ONCE outside the iteration loop so the same structure is
     sent on every roundtrip → cache key matches → reads hit.
     Wrapped via new module-level helper ``_wrap_system_as_
     cacheable``.
  2. Same loop — kwargs["tools"] now has cache_control on the
     LAST tool only (covers everything before it per API
     semantics). Wrapped via ``_wrap_tools_as_cacheable``;
     input list never mutated.

Both wrappers tested as pure functions + via engine-level
integration assertions. Tools-empty case preserved: registry
fails → tools=[] → kwargs["tools"] omitted entirely (some SDK
versions reject empty arrays).

# Cost-ladder accounting patch

The cost-ladder infrastructure ALREADY supported cache tokens
via ``CanonicalUsage(cache_read_tokens, cache_write_tokens)`` +
``PricingEntry.cache_*_cost_per_million`` (Opus 4.7: $0.50
cache_read, $6.25 cache_write per million tokens). The gap was
the engine→handler→holder data flow:

  1. ``ResponseResult`` extended with
     ``cache_creation_input_tokens`` + ``cache_read_input_tokens``
     fields (default 0, backwards-compatible).
  2. ``_tool_use_loop`` accumulates both per iteration (mirror
     of input/output accumulation) reading from
     ``usage.cache_creation_input_tokens`` /
     ``usage.cache_read_input_tokens`` (getattr-default to 0
     when older SDK / uncached call leaves them absent).
  3. ``_project_final_response`` signature + body pass them
     through. Max-iter exit + the projection-failed branch also
     wired.
  4. ``slack_dm_handler.py``: initial ``reasoning_meta`` dict +
     the engine-exception branch + the success-path meta build
     all gained the two cache fields.
  5. ``_record_inference_to_cost_ladder`` reads the cache fields
     from meta + constructs ``CanonicalUsage(input_tokens,
     output_tokens, cache_write_tokens, cache_read_tokens)`` —
     PricingEntry's per-million rates do the rest. Early-return
     guard now checks ALL four token buckets (a pure-cache-read
     call still bills).
  6. ``_append_outbound_log_entry`` extended with the same two
     optional kwargs so the cache totals persist into the slack
     DM outbound JSONL — reasoning panel can surface cache-hit
     rate per call without grepping logs.

# K-DG verification

Anthropic SDK 0.87.0 (one minor ahead of the spec's 0.86.0)
verified to expose ``CacheControlEphemeralParam``,
``TextBlockParam``, ``ToolParam`` — cache_control syntax
identical between minor versions per the docs at
https://platform.claude.com/docs/en/agents-and-tools/prompt-caching.

# Tests

tests/kora_cli/reasoning/test_anthropic_engine_caching.py — 16
new tests:

  - Wrapper unit tests: system → content-block list with marker;
    tools → last marked, earlier unmarked; input not mutated;
    empty → empty; single-tool gets marker
  - Engine integration: system kwarg shape; tools[-1] has marker
    + tools[:-1] do NOT; tools kwarg omitted when registry
    returns empty; SAME system + tools sent on every iteration
    (cache key stability across tool-use loop)
  - ResponseResult surfaces cache_creation + cache_read totals;
    accumulates across iterations (write on iter 1, reads on
    iters 2+3); defaults to 0 when usage attrs missing
  - Handler cost-ladder: reasoning_meta cache fields flow into
    CanonicalUsage(cache_write_tokens, cache_read_tokens);
    None → 0 fallback; pure-cache-read call still bills; all-
    zero call still skips

tests/kora_cli/reasoning/test_anthropic_engine.py — 1 pre-
existing test updated: ``test_system_prompt_passed_to_sdk`` now
asserts the content-block list shape instead of bare-string.

# Regression

483/483 reasoning + handlers + listeners pass serially.
Full repo xdist: 9228 passed, 43 failed identical to baseline
(test_anthropic_adapter / test_backup / test_config /
test_gateway_* / test_web_server* / test_kanban_db /
test_list_picker_providers / test_model_switch_* /
test_startup_plugin_gating / test_web_server_cron_profiles).
Zero failures in reasoning or handlers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rafe-walker rafe-walker merged commit 559c46f into feature/phase2-upgrades May 24, 2026
@rafe-walker rafe-walker deleted the feat/kora-KR-CHEAP-PROMPT-CACHING branch May 24, 2026 01:28
rafe-walker added a commit that referenced this pull request May 24, 2026
Lock R3-8 sub-cut (c) implementation. 34 panels instrumented (8 panels + 26 pages).

Backend: POST /api/panel_view → ${KORA_HOME}/panel_views.jsonl (Path B chosen — separate file from kora_audit_log.jsonl to preserve audit log's forensic semantics per CC#2's K-DG sweep).

Hook: web/src/hooks/usePanelView.ts — fire-and-forget POST on mount; silent failure (instrumentation must never break UX).

18/18 endpoint+pin tests + 210/210 regression. tsc -b + vite build clean.

Rebased onto current feature/phase2-upgrades (post #157 snapshot + #158 caching) to resolve adjacent-endpoint-addition conflict in kora_cli/web_server.py.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant