Problem
User reports we get noticeably worse prefix / token cache hit rates than Claude Code on similar workloads. DeepSeek V4 has a real prefix cache (cache_hit_tokens / cache_miss_tokens are returned per turn — see cache_telemetry_supported), and the system prompt is already split into stable cache-friendly prefix blocks vs a volatile working-set tail. So either the split isn't actually stable, or something downstream is churning bytes that should be invariant.
This issue tracks an end-to-end audit of every place we send bytes to the API to identify what's accidentally non-deterministic across turns.
Background
- DeepSeek context caching reference: https://api-docs.deepseek.com/guides/kv_cache — DeepSeek's KV cache hits depend on a stable prefix at the byte level. Any change in the first N bytes invalidates the cache for everything after byte N.
- Prompt Cache paper (Gim et al. 2023, arXiv:2311.04934, Prompt Cache: Modular Attention Reuse for Low-Latency Inference) — formalizes the modular attention reuse that DeepSeek's cache implements in spirit.
- Anthropic's
cache_control — explicit per-block cache markers; we already use this shape via SystemPrompt::Blocks. Worth comparing how Claude Code structures its blocks vs ours.
Suspected sources of cache churn (ranked by likelihood)
- Working-set summary block ordering —
working_set::summary_block (crates/tui/src/working_set.rs) injects a "Repo Working Set" block at the END of the system prompt. Good. But it includes Workspace: <abs path> and a list of "open files" that may reorder turn-to-turn. Reorder = full cache miss for every byte after the working set block.
- Skills block ordering —
render_available_skills_context (crates/tui/src/skills/mod.rs) — sorts by name (deterministic, good) but the description text is read from disk at call time. Editing a SKILL.md between turns invalidates the skills block and everything that comes after it in the system prompt.
- Project context (
crates/tui/src/project_context.rs) — embedded into the stable prefix. Re-reads from disk; if any timestamp or hash sneaks in, churn.
- Mode prompt (
crates/tui/src/prompts.rs) — if mode changes (Plan ↔ Agent ↔ YOLO) the mode prompt changes. Expected. But within one mode it must be byte-identical.
- Tool catalog ordering — tool registration in
engine/turn_loop.rs and the per-tool JSON schema inserted into the request. HashMap iteration order would be a disaster here. Confirm we use BTreeMap or sort by tool name.
- Tool-call IDs in assistant messages — when assistant messages with tool calls replay, the
id field is per-call. If we generate a fresh UUID instead of reusing the original from the API, the entire message bytes diverge from cache.
- Reasoning content replay (V4 thinking mode) —
engine/turn_loop.rs inserts a placeholder when a round produces no reasoning. The placeholder text must be byte-identical across turns for the same shape.
- Date / timestamp injection — anywhere a
chrono::Utc::now() or format!("{:?}", system_time) lands inside a cached block, it's a per-turn miss.
- Compaction summary (
crates/tui/src/compaction.rs) — once compaction fires, the summary becomes part of the stable prefix for subsequent turns. If we re-summarize on each turn instead of caching the summary, churn.
- Skills-in-skills directory invalidation — every turn we discover skills via
SkillRegistry::discover reading the directory. If two skills have the same description but different mtime ordering, sort order is fine but if we ever fall back to mtime/inode, churn.
Investigation steps
- Telemetry dump: temporarily log
cache_hit_tokens / cache_miss_tokens per turn (we already track them; just expose via a new debug command /cache that shows the last 10 turns' hit rates). This makes the audit measurable rather than hand-wavy.
- Byte-diff tool: write a tiny script that captures the JSON body sent to
https://api.deepseek.com/v1/chat/completions for two consecutive turns, byte-diffs them, and prints the position of the first differing byte. The first divergence tells us where the cache breaks.
- Suspect-by-suspect bisection using the byte-diff tool against synthetic two-turn flows that hold each suspect constant.
- Compare with Claude Code's payload structure on the same workload (proxy through mitmproxy or similar) to see how their blocks are ordered and what their cache_control hints look like.
- Update
working_set::summary_block and similar volatile-content surfaces to be deterministic — sort lists by name, drop timestamps, etc.
- Document the cache-stability invariants in
crates/tui/src/prompts.rs so future contributors know what they must NOT do.
Acceptance criteria
/cache command exists, shows hit/miss/ratio for the last N turns
- A regression test fixture in
crates/tui/tests/ runs two synthetic turns over the same workspace and asserts the first N kilobytes of the request body are byte-identical
- Documented invariants in
prompts.rs covering: deterministic ordering, no timestamps, no path normalization differences across calls
- Hit rate measurably improved on a long multi-turn session (concrete target: >70% hit rate after the third turn, against current baseline TBD)
Out of scope
Notes for an investigating agent
The repo's existing telemetry and /context command give you most of the visibility you need. Start at step 1 (the /cache debug surface) — without measurement this is unfounded speculation. Step 2 (byte-diff) is the single highest-leverage diagnostic. Suspects are ranked above by my best guess but the byte-diff settles which one is actually breaking it.
Problem
User reports we get noticeably worse prefix / token cache hit rates than Claude Code on similar workloads. DeepSeek V4 has a real prefix cache (cache_hit_tokens / cache_miss_tokens are returned per turn — see
cache_telemetry_supported), and the system prompt is already split into stable cache-friendly prefix blocks vs a volatile working-set tail. So either the split isn't actually stable, or something downstream is churning bytes that should be invariant.This issue tracks an end-to-end audit of every place we send bytes to the API to identify what's accidentally non-deterministic across turns.
Background
cache_control— explicit per-block cache markers; we already use this shape viaSystemPrompt::Blocks. Worth comparing how Claude Code structures its blocks vs ours.Suspected sources of cache churn (ranked by likelihood)
working_set::summary_block(crates/tui/src/working_set.rs) injects a "Repo Working Set" block at the END of the system prompt. Good. But it includesWorkspace: <abs path>and a list of "open files" that may reorder turn-to-turn. Reorder = full cache miss for every byte after the working set block.render_available_skills_context(crates/tui/src/skills/mod.rs) — sorts by name (deterministic, good) but the description text is read from disk at call time. Editing a SKILL.md between turns invalidates the skills block and everything that comes after it in the system prompt.crates/tui/src/project_context.rs) — embedded into the stable prefix. Re-reads from disk; if any timestamp or hash sneaks in, churn.crates/tui/src/prompts.rs) — if mode changes (Plan ↔ Agent ↔ YOLO) the mode prompt changes. Expected. But within one mode it must be byte-identical.engine/turn_loop.rsand the per-tool JSON schema inserted into the request.HashMapiteration order would be a disaster here. Confirm we useBTreeMapor sort by tool name.idfield is per-call. If we generate a fresh UUID instead of reusing the original from the API, the entire message bytes diverge from cache.engine/turn_loop.rsinserts a placeholder when a round produces no reasoning. The placeholder text must be byte-identical across turns for the same shape.chrono::Utc::now()orformat!("{:?}", system_time)lands inside a cached block, it's a per-turn miss.crates/tui/src/compaction.rs) — once compaction fires, the summary becomes part of the stable prefix for subsequent turns. If we re-summarize on each turn instead of caching the summary, churn.SkillRegistry::discoverreading the directory. If two skills have the same description but different mtime ordering, sort order is fine but if we ever fall back to mtime/inode, churn.Investigation steps
cache_hit_tokens/cache_miss_tokensper turn (we already track them; just expose via a new debug command/cachethat shows the last 10 turns' hit rates). This makes the audit measurable rather than hand-wavy.https://api.deepseek.com/v1/chat/completionsfor two consecutive turns, byte-diffs them, and prints the position of the first differing byte. The first divergence tells us where the cache breaks.working_set::summary_blockand similar volatile-content surfaces to be deterministic — sort lists by name, drop timestamps, etc.crates/tui/src/prompts.rsso future contributors know what they must NOT do.Acceptance criteria
/cachecommand exists, shows hit/miss/ratio for the last N turnscrates/tui/tests/runs two synthetic turns over the same workspace and asserts the first N kilobytes of the request body are byte-identicalprompts.rscovering: deterministic ordering, no timestamps, no path normalization differences across callsOut of scope
Notes for an investigating agent
The repo's existing telemetry and
/contextcommand give you most of the visibility you need. Start at step 1 (the/cachedebug surface) — without measurement this is unfounded speculation. Step 2 (byte-diff) is the single highest-leverage diagnostic. Suspects are ranked above by my best guess but the byte-diff settles which one is actually breaking it.