Audit prefix-cache hit rate vs Claude Code; identify where we churn the DeepSeek context cache

## Problem

User reports we get noticeably worse prefix / token cache hit rates than Claude Code on similar workloads. DeepSeek V4 has a real prefix cache (cache_hit_tokens / cache_miss_tokens are returned per turn — see `cache_telemetry_supported`), and the system prompt is already split into stable cache-friendly prefix blocks vs a volatile working-set tail. So either the split isn't actually stable, or something downstream is churning bytes that should be invariant.

This issue tracks an end-to-end audit of every place we send bytes to the API to identify what's accidentally non-deterministic across turns.

## Background

- **DeepSeek context caching reference**: https://api-docs.deepseek.com/guides/kv_cache — DeepSeek's KV cache hits depend on a stable prefix at the byte level. Any change in the first N bytes invalidates the cache for everything after byte N.
- **Prompt Cache paper** (Gim et al. 2023, arXiv:2311.04934, *Prompt Cache: Modular Attention Reuse for Low-Latency Inference*) — formalizes the modular attention reuse that DeepSeek's cache implements in spirit.
- **Anthropic's `cache_control`** — explicit per-block cache markers; we already use this shape via `SystemPrompt::Blocks`. Worth comparing how Claude Code structures its blocks vs ours.

## Suspected sources of cache churn (ranked by likelihood)

1. **Working-set summary block ordering** — `working_set::summary_block` (`crates/tui/src/working_set.rs`) injects a "Repo Working Set" block at the END of the system prompt. Good. But it includes `Workspace: <abs path>` and a list of "open files" that may reorder turn-to-turn. **Reorder = full cache miss for every byte after the working set block.**
2. **Skills block ordering** — `render_available_skills_context` (`crates/tui/src/skills/mod.rs`) — sorts by name (deterministic, good) but the description text is read from disk at call time. Editing a SKILL.md between turns invalidates the skills block and everything that comes after it in the system prompt.
3. **Project context** (`crates/tui/src/project_context.rs`) — embedded into the stable prefix. Re-reads from disk; if any timestamp or hash sneaks in, churn.
4. **Mode prompt** (`crates/tui/src/prompts.rs`) — if mode changes (Plan ↔ Agent ↔ YOLO) the mode prompt changes. Expected. But within one mode it must be byte-identical.
5. **Tool catalog ordering** — tool registration in `engine/turn_loop.rs` and the per-tool JSON schema inserted into the request. `HashMap` iteration order would be a disaster here. Confirm we use `BTreeMap` or sort by tool name.
6. **Tool-call IDs in assistant messages** — when assistant messages with tool calls replay, the `id` field is per-call. If we generate a fresh UUID instead of reusing the original from the API, the entire message bytes diverge from cache.
7. **Reasoning content replay** (V4 thinking mode) — `engine/turn_loop.rs` inserts a placeholder when a round produces no reasoning. The placeholder text must be byte-identical across turns for the same shape.
8. **Date / timestamp injection** — anywhere a `chrono::Utc::now()` or `format!("{:?}", system_time)` lands inside a cached block, it's a per-turn miss.
9. **Compaction summary** (`crates/tui/src/compaction.rs`) — once compaction fires, the summary becomes part of the stable prefix for subsequent turns. If we re-summarize on each turn instead of caching the summary, churn.
10. **Skills-in-skills directory invalidation** — every turn we discover skills via `SkillRegistry::discover` reading the directory. If two skills have the same description but different mtime ordering, sort order is fine but if we ever fall back to mtime/inode, churn.

## Investigation steps

1. **Telemetry dump**: temporarily log `cache_hit_tokens` / `cache_miss_tokens` per turn (we already track them; just expose via a new debug command `/cache` that shows the last 10 turns' hit rates). This makes the audit measurable rather than hand-wavy.
2. **Byte-diff tool**: write a tiny script that captures the JSON body sent to `https://api.deepseek.com/v1/chat/completions` for two consecutive turns, byte-diffs them, and prints the position of the first differing byte. The first divergence tells us where the cache breaks.
3. **Suspect-by-suspect bisection** using the byte-diff tool against synthetic two-turn flows that hold each suspect constant.
4. Compare with Claude Code's payload structure on the same workload (proxy through mitmproxy or similar) to see how their blocks are ordered and what their cache_control hints look like.
5. Update `working_set::summary_block` and similar volatile-content surfaces to be deterministic — sort lists by name, drop timestamps, etc.
6. Document the cache-stability invariants in `crates/tui/src/prompts.rs` so future contributors know what they must NOT do.

## Acceptance criteria

- `/cache` command exists, shows hit/miss/ratio for the last N turns
- A regression test fixture in `crates/tui/tests/` runs two synthetic turns over the same workspace and asserts the first N kilobytes of the request body are byte-identical
- Documented invariants in `prompts.rs` covering: deterministic ordering, no timestamps, no path normalization differences across calls
- Hit rate measurably improved on a long multi-turn session (concrete target: >70% hit rate after the third turn, against current baseline TBD)

## Out of scope

- Changing DeepSeek's actual cache semantics (we don't control the server)
- Implementing our own client-side cache (DeepSeek's is the source of truth)
- Tokenizer accuracy (#232 closure — kept the chars/3 heuristic deliberately)

## Notes for an investigating agent

The repo's existing telemetry and `/context` command give you most of the visibility you need. Start at step 1 (the `/cache` debug surface) — without measurement this is unfounded speculation. Step 2 (byte-diff) is the single highest-leverage diagnostic. Suspects are ranked above by my best guess but the byte-diff settles which one is actually breaking it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit prefix-cache hit rate vs Claude Code; identify where we churn the DeepSeek context cache #263

Problem

Background

Suspected sources of cache churn (ranked by likelihood)

Investigation steps

Acceptance criteria

Out of scope

Notes for an investigating agent

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Audit prefix-cache hit rate vs Claude Code; identify where we churn the DeepSeek context cache #263

Description

Problem

Background

Suspected sources of cache churn (ranked by likelihood)

Investigation steps

Acceptance criteria

Out of scope

Notes for an investigating agent

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions