DeepSeek V4 Flash cache hit rate dropped from ~98% to ~81% after v0.15.10 (ToolSearch) #4065
Replies: 2 comments 1 reply
-
|
👋 @TradingLaboratory Thanks for this incredibly thorough analysis — the data is very compelling. Your root cause diagnosis is correct. ToolSearch changes how tool declarations appear in the system prompt: instead of being stably inlined in the prefix for every request, they are now loaded on-demand, which breaks DeepSeek's prefix-based KV caching. There is currently no option to disable ToolSearch. We'll add a config option (e.g. I'll track this in an issue. Thanks again for the detailed report and cost data — this is really helpful. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
After updating Qwen Code CLI from v0.15.9 to v0.15.10, I noticed a dramatic drop in DeepSeek V4 Flash cached token ratio through OpenRouter. Here's the data:
Before v0.15.10 (9 days of data):
After v0.15.10:
Impact summary
Despite processing less than half the tokens compared to my daily average, I paid 3× more — from $1.05 to $3.30 per day. The uncached tokens skyrocketed from ~3M to 12.9M, a 4.3× increase. This is a significant real-world cost impact.
Root cause analysis
The likely culprit is ToolSearch (PR #3589) introduced in v0.15.10.
Here's the mechanism:
Before v0.15.9: All MCP tool declarations were embedded directly in the system prompt at the start of every request. This created a stable, identical prefix across all requests within a session.
DeepSeek's caching model: DeepSeek uses prefix-based KV caching — it caches the beginning of the prompt on disk. If a subsequent request starts with the exact same prefix, the cached portion is reused at 10% of the original cost. This requires byte-identical prefix matching.
What ToolSearch changes: PR feat(tools): add ToolSearch for on-demand loading of deferred tool schemas #3589 defers tool loading — MCP tools are now loaded on-demand via a
ToolSearchcall instead of being declared upfront. This means each request may have a different set of loaded tools in its prompt prefix, breaking the prefix stability.The result: The prompt prefix changes between requests → DeepSeek's prefix-based cache misses → most tokens are billed at full uncached rate → cost spikes.
This is a trade-off: ToolSearch saves ~15K tokens per request in prompt size, but at the cost of breaking prefix-based caching for models like DeepSeek that rely on stable prefixes. For heavy users of DeepSeek through OpenRouter, the savings from smaller prompts are dwarfed by the increased cost from cache misses.
Has anyone else observed this? Is there a way to disable ToolSearch or keep tool declarations stable for models that benefit from prefix caching?
Beta Was this translation helpful? Give feedback.
All reactions