fix(cache): improve Anthropic prompt cache hit rate with system split and tool stability#14743
fix(cache): improve Anthropic prompt cache hit rate with system split and tool stability#14743bhagirathsinh-vaghela wants to merge 6 commits intoanomalyco:devfrom
Conversation
|
The following comment was made by an LLM, it may be inaccurate: Potential related PRs found:
Note: PR #14203 appears to be the most directly related, as it's specifically about the system prompt splitting strategy that is a key component of PR #14743's improvements. |
|
Thanks for updating your PR! It now meets our contributing guidelines. 👍 |
|
Reviewer's guide — supplementary context not covered in the PR description. Uses same terminology (S1/S2, M1/M2) defined there. AI SDK cache marker mechanicsRef: Anthropic prompt caching docs | Anthropic engineers' caching best practices (Feb 19 2026): Thariq Shihipar, R. Lance Martin Max 4 Key subtlety: before this PR, OpenCode had a single system block. M1 covered it, but M2 was unused — it fell through to conversation. The system split (commit 3) is what activates both markers, letting S1 (stable) cache independently from S2 (dynamic). Since M1 covers the tool block too (tools hash before system in Anthropic's ordering), any tool instability (commits 4–5) completely invalidates M1 — the entire cached prefix up to that marker is lost. Related open PRsSeveral open PRs address parts of this (#5422, #14203, #10380, #11492). This PR addresses the root causes directly. |
b67a66a to
906a317
Compare
906a317 to
c499424
Compare
|
CI failure seems pre-existing — same |
c499424 to
176c069
Compare
|
I pulled this into my fork and it's working beautifully. Unfortunately I only found this after getting a huge bill from Anthropic. Thanks OpenCode! |
|
@bhagirathsinh-vaghela could you check this with SLMs like Qwen3 or Nemotron or Kimi-Linear or GPT-OSS? Or providers using the OpenAI-compatible APIs (e.g. OpenRouter)? Bonus ask: would Speculative Decoding work with this fork? I am looking at this from the lens of vLLM-MLX and MLX-OpenAI-Server (for non-MLX there is vLLM). |
176c069 to
f08aa45
Compare
The fixes are provider/model-agnostic — they stabilize the request prefix so it is byte-for-byte identical across calls. Any provider with server-side prefix caching benefits automatically. See my reviewer's guide comment above for the full breakdown of each fix. The specific model behind the provider does not matter — the changes are purely at the request layer. You can verify with any provider using E2E failures — pre-existing upstream issue, since fixed. CI is green now. Speculative decoding — orthogonal. This PR only changes what is sent in the request, not how the server processes it. |
…NCODE_CACHE_AUDIT
…OPENCODE_EXPERIMENTAL_CACHE_1H_TTL flag
f08aa45 to
7984393
Compare
| ` Is directory a git repo: ${project.vcs === "git" ? "yes" : "no"}`, | ||
| ` Platform: ${process.platform}`, | ||
| ` Today's date: ${new Date().toDateString()}`, | ||
| ` Today's date: ${date.toDateString()}`, |
There was a problem hiding this comment.
Would it make sense to change the wording here, to hint to the LLM that this isn't a live updating value? Otherwise it might make some weird choices elsewhere for long lived conversations. E.g.
| ` Today's date: ${date.toDateString()}`, | |
| ` Session started at: ${date.toDateString()}`, |
There was a problem hiding this comment.
Good point — this is better to show when the date is frozen. I'm keeping Today's date in this PR for now since it's what all OpenCode users expect(at least by experience even if they are not aware), but I'm not against the change if maintainers agree.
Separately, I've been experimenting locally with a progressive disclosure approach — making the env block fully static, instructing the model to fetch cwd, date, platform, etc. via tool calls when needed. Eliminates the block 2 cache write entirely at the cost of an occasional extra round-trip.
Interesting finding in this approach: completely removing the env block tended to result in models not bothering to fetch the info at all and assume things which is non deterministic. A static block with explicit "figure out when needed" instructions worked much better, at least with Anthropic models.
There was a problem hiding this comment.
Separately, I've been experimenting locally with a progressive disclosure approach — making the env block fully static, instructing the model to fetch cwd, date, platform, etc. via tool calls when needed. [...] A static block with explicit "figure out when needed" instructions worked much better, at least with Anthropic models.
Hmm! I'll have to give that a shot when I patch from this PR later; I'm running locally against one of the Qwen3.5 models, so it'll be interesting data to see how they respond.
|
Looking forward to seeing less prompt re-processing with opencode. Unfortunately it seems currently this patchset breaks llama.cpp support:
Tested with and without the new autoparser. Maybe I'm using it wrong? |
|
So, after partially reverting fix(cache): split system prompt into 2 blocks for independent caching, or rather naively ensuring llama.cpp gets just one system prompt (revert.patch) opencode now flies with this patchset using a llama.cpp endpoint (openai api though). No more "erased invalidated context checkpoint" for all checkpoints and reprocessing of the entire context seemingly whenever I send a new query. Checkpoint reuse happens usually at around 99 %, sometimes drops to 93 % - lowest was in the 70 % with > 60k tokens. Much appreciated! Wonder whether the split system message is something @pwilkin would be willing to support or whether it should be guarded to only be sent to Antrophic endpoints. |
|
Any chance the system message could be moved to the top of the messages list? We could possibly do this for the Anthropic API, but technically the system prompt should be the first message. |
|
Thanks @pwilkin. Given this is actually coming from the model template (Qwen 3.5) and not the parser: this should probably best be handled on OpenCode's end. |
Issue for this PR
Closes #5416, #5224
Related: #14065, #5422, #14203
Type of change
What does this PR do?
Fixes cross-repo and cross-session Anthropic prompt cache misses. Same-session caching already works (AI SDK places markers correctly). This PR fixes the cases where the prefix changes between repos, sessions, or process restarts — causing full cache writes on every first prompt.
Anthropic hashes tools → system → messages in prefix order. Any change to an earlier block invalidates everything after it. OpenCode has several sources of unnecessary prefix changes.
Terminology (1-indexed): S1/S2 = system block 1/2. M1/M2 = cache marker on S1/S2.
Always-active fixes:
System prompt is a single block — dynamic content (env, project AGENTS.md) invalidates the stable provider prompt. Split into 2 blocks: stable (provider prompt + global AGENTS.md) first, dynamic (env + project) second.
Bash tool schema includes
Instance.directory— changes per-repo, invalidating tool hash. Removed; model gets cwd from the environment block.Skill tool ordering is nondeterministic —
Object.values()on glob results. Sorted by name.Opt-in fixes (behind env var flags):
Date and instructions change between turns —
OPENCODE_EXPERIMENTAL_CACHE_STABILIZATION=1freezes date and caches instruction file reads for the process lifetime.Extended cache TTL —
OPENCODE_EXPERIMENTAL_CACHE_1H_TTL=1sets 1h TTL on M1 (2x write cost vs 1.25x for default 5-min). Useful for sessions with idle gaps.Commits:
OPENCODE_CACHE_AUDITOPENCODE_EXPERIMENTAL_CACHE_STABILIZATIONOPENCODE_EXPERIMENTAL_CACHE_1H_TTLWhat this doesn't fix:
Impact beyond Anthropic: The prefix stability fixes also benefit providers with automatic prefix caching (OpenAI, DeepSeek, Gemini, xAI, Groq) — no markers needed, just a stable prefix.
How did you verify your code works?
OPENCODE_CACHE_AUDIT=1logs[CACHE]hit/miss per LLM call. Tested with Claude Sonnet 4.6 on Anthropic direct API,bun dev, Feb 23 2026.Cross-repo (different folder, within 5-min TTL — the key improvement):
BEFORE (no fixes):
AFTER (system split + tool stability):
The first prompt in a new repo goes from 0% → 97.6% cache hit. S1 (tools + provider prompt + global AGENTS.md) is reused across repos. These numbers are based on my setup — S1 is ~17,345 tokens, mostly tool definitions (~12k tokens), with provider prompt (~2k) and global AGENTS.md (~2.8k) making up the rest. Your numbers will differ based on your tool set (MCP servers, skills) and global AGENTS.md size, but the cross-repo miss is eliminated regardless.
Only block 2 (env with different cwd = 428 tokens) is a cache write on the first prompt in a new repo.
To reproduce:
Screenshots / recordings
N/A — no UI changes.
Checklist