Skip to content

Audit prefix-cache hit rate vs Claude Code; identify where we churn the DeepSeek context cache #263

@Hmbown

Description

@Hmbown

Problem

User reports we get noticeably worse prefix / token cache hit rates than Claude Code on similar workloads. DeepSeek V4 has a real prefix cache (cache_hit_tokens / cache_miss_tokens are returned per turn — see cache_telemetry_supported), and the system prompt is already split into stable cache-friendly prefix blocks vs a volatile working-set tail. So either the split isn't actually stable, or something downstream is churning bytes that should be invariant.

This issue tracks an end-to-end audit of every place we send bytes to the API to identify what's accidentally non-deterministic across turns.

Background

  • DeepSeek context caching reference: https://api-docs.deepseek.com/guides/kv_cache — DeepSeek's KV cache hits depend on a stable prefix at the byte level. Any change in the first N bytes invalidates the cache for everything after byte N.
  • Prompt Cache paper (Gim et al. 2023, arXiv:2311.04934, Prompt Cache: Modular Attention Reuse for Low-Latency Inference) — formalizes the modular attention reuse that DeepSeek's cache implements in spirit.
  • Anthropic's cache_control — explicit per-block cache markers; we already use this shape via SystemPrompt::Blocks. Worth comparing how Claude Code structures its blocks vs ours.

Suspected sources of cache churn (ranked by likelihood)

  1. Working-set summary block orderingworking_set::summary_block (crates/tui/src/working_set.rs) injects a "Repo Working Set" block at the END of the system prompt. Good. But it includes Workspace: <abs path> and a list of "open files" that may reorder turn-to-turn. Reorder = full cache miss for every byte after the working set block.
  2. Skills block orderingrender_available_skills_context (crates/tui/src/skills/mod.rs) — sorts by name (deterministic, good) but the description text is read from disk at call time. Editing a SKILL.md between turns invalidates the skills block and everything that comes after it in the system prompt.
  3. Project context (crates/tui/src/project_context.rs) — embedded into the stable prefix. Re-reads from disk; if any timestamp or hash sneaks in, churn.
  4. Mode prompt (crates/tui/src/prompts.rs) — if mode changes (Plan ↔ Agent ↔ YOLO) the mode prompt changes. Expected. But within one mode it must be byte-identical.
  5. Tool catalog ordering — tool registration in engine/turn_loop.rs and the per-tool JSON schema inserted into the request. HashMap iteration order would be a disaster here. Confirm we use BTreeMap or sort by tool name.
  6. Tool-call IDs in assistant messages — when assistant messages with tool calls replay, the id field is per-call. If we generate a fresh UUID instead of reusing the original from the API, the entire message bytes diverge from cache.
  7. Reasoning content replay (V4 thinking mode) — engine/turn_loop.rs inserts a placeholder when a round produces no reasoning. The placeholder text must be byte-identical across turns for the same shape.
  8. Date / timestamp injection — anywhere a chrono::Utc::now() or format!("{:?}", system_time) lands inside a cached block, it's a per-turn miss.
  9. Compaction summary (crates/tui/src/compaction.rs) — once compaction fires, the summary becomes part of the stable prefix for subsequent turns. If we re-summarize on each turn instead of caching the summary, churn.
  10. Skills-in-skills directory invalidation — every turn we discover skills via SkillRegistry::discover reading the directory. If two skills have the same description but different mtime ordering, sort order is fine but if we ever fall back to mtime/inode, churn.

Investigation steps

  1. Telemetry dump: temporarily log cache_hit_tokens / cache_miss_tokens per turn (we already track them; just expose via a new debug command /cache that shows the last 10 turns' hit rates). This makes the audit measurable rather than hand-wavy.
  2. Byte-diff tool: write a tiny script that captures the JSON body sent to https://api.deepseek.com/v1/chat/completions for two consecutive turns, byte-diffs them, and prints the position of the first differing byte. The first divergence tells us where the cache breaks.
  3. Suspect-by-suspect bisection using the byte-diff tool against synthetic two-turn flows that hold each suspect constant.
  4. Compare with Claude Code's payload structure on the same workload (proxy through mitmproxy or similar) to see how their blocks are ordered and what their cache_control hints look like.
  5. Update working_set::summary_block and similar volatile-content surfaces to be deterministic — sort lists by name, drop timestamps, etc.
  6. Document the cache-stability invariants in crates/tui/src/prompts.rs so future contributors know what they must NOT do.

Acceptance criteria

  • /cache command exists, shows hit/miss/ratio for the last N turns
  • A regression test fixture in crates/tui/tests/ runs two synthetic turns over the same workspace and asserts the first N kilobytes of the request body are byte-identical
  • Documented invariants in prompts.rs covering: deterministic ordering, no timestamps, no path normalization differences across calls
  • Hit rate measurably improved on a long multi-turn session (concrete target: >70% hit rate after the third turn, against current baseline TBD)

Out of scope

Notes for an investigating agent

The repo's existing telemetry and /context command give you most of the visibility you need. Start at step 1 (the /cache debug surface) — without measurement this is unfounded speculation. Step 2 (byte-diff) is the single highest-leverage diagnostic. Suspects are ranked above by my best guess but the byte-diff settles which one is actually breaking it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    contextContext management / contextenhancementNew feature or request

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions