fix(core): stabilize DeepSeek tool cache prefix#4518
Conversation
wenshao
left a comment
There was a problem hiding this comment.
LGTM — clean, well-motivated refactor. The getCachedPromptTokens extraction is a good de-duplication, and the hostname-gated tool sort is correctly scoped.
Minor suggestions for follow-up (low confidence, not blocking):
- Consider adding a test that verifies the
getCachedPromptTokensprecedence chain (e.g.,prompt_tokens_details.cached_tokenswins overprompt_cache_hit_tokenswhen both are present). - Consider adding a streaming-path test for
convertOpenAIChunkToGeminiwith DeepSeek cache fields. prompt_cache_miss_tokensis declared inExtendedCompletionUsagebut never read — consider removing or documenting intent.
— qwen3.7-max via Qwen Code /review
pomelo-nwu
left a comment
There was a problem hiding this comment.
Hi @Jerry2003826, thank you for your continued contributions — 9 PRs in a short time is impressive! 🎉
As we review your changes, we'd like to ask you to update each PR to follow the latest PR template on the main branch. The most important section is the Reviewer Test Plan, which significantly accelerates the review and merge process.
Specifically, for each PR please include:
- How to verify — clear reproduction steps so a reviewer can confirm the fix/feature
- Evidence (Before & After) — use the
tmux-real-user-testingskill (or manual tmux capture) to show before/after screenshots of the TUI behavior. Side-by-side evidence makes it much faster for maintainers to validate and merge - Tested on — fill in the OS table (macOS / Windows / Linux)
PRs with a complete Reviewer Test Plan are prioritized for review — without it, review may be delayed.
You can see the full template at: .github/pull_request_template.md
Thanks again for your effort — looking forward to getting these merged! 🚀
中文说明
你好 @Jerry2003826,感谢你的持续贡献——短时间内提交了 9 个 PR,非常高效!🎉
在 review 过程中,我们希望你能按照 main 分支上最新的 PR 模版更新每个 PR 的描述。其中最关键的部分是 Reviewer Test Plan,它能显著加速审核和合并流程。
具体来说,请为每个 PR 补充:
- How to verify — 清晰的复现步骤,让 reviewer 能确认修复/功能的效果
- Evidence (Before & After) — 使用
tmux-real-user-testingskill(或手动 tmux 截取)展示修改前后的 TUI 截图对比,前后对比能让维护者更快地验证和合并 - Tested on — 填写操作系统测试表格(macOS / Windows / Linux)
有完整 Reviewer Test Plan 的 PR 会被优先审核——缺少该部分可能会导致审核延迟。
完整模版见:.github/pull_request_template.md
再次感谢你的付出,期待尽快把这些 PR 合并!🚀
— Qwen Code
|
Updated the PR description to the latest template, including Reviewer Test Plan, Evidence, Tested on, Risk & Scope, and the Chinese summary. No code changes were needed for this update; the existing validation remains: npm test --workspace=packages/core -- src/core/openaiContentGenerator/pipeline.test.ts
npm test --workspace=packages/core -- src/core/openaiContentGenerator/converter.test.ts
npm run typecheck --workspace=packages/core
npm run lint --workspace=packages/core
git diff --check |
本地验证报告 — PR #4518作为 maintainer 在本地用 tmux 跑了完整真实验证。结论:通过,建议合并。 验证环境
1. 单元测试
间接 + 下游测试是通过 2. 静态检查3. 真实端到端复现(用 tsx 直接 import 源码)独立 repro 脚本 用 4. 代码层面的观察正向:
潜在风险(不阻塞,仅记录):
5. 没在本地覆盖
结论
建议合并。 没有阻塞问题。 — wenshao |
Live DeepSeek API 验证 — PR #4518 补充之前的 验证报告 跑的是单元测试 + tsx 端到端,但没击中真实的 DeepSeek API。这次补充用 1. 验证方式写了一个独立脚本
所有请求
2. 真实 API 跑出的结果3. 命中率对照
PR 描述里写的 "from about 99.5% to 0%" 在 deepseek-v4-pro 上 1:1 复现(实测 99.86% → 0% → 99.86%)。 4. 进一步探查:DeepSeek 缓存的真实行为为了排除「是不是只是 prompt 前缀某段在缓存」,又写了 结论:DeepSeek 的 prompt 缓存对 tools 数组顺序非常敏感,哪怕只是反转或洗牌(tool 集 / schema 完全相同),缓存就 100% 击穿。回到原始顺序,缓存立即恢复。这正是 PR 的 sort 修复要解决的问题。 5. 价值估算按这次实测:每次「tool 顺序漂移」会让一次请求多消耗 ~8320 prompt tokens(按 deepseek-v4-pro 当前定价 cache miss 约 $0.14/M tokens vs cache hit 约 $0.014/M tokens 算,单次约多花 0.1 美分 + 数百毫秒 prefill 延迟)。对一个跑长 session、多次重新组装 tool 列表的 agent workflow(比如 MCP 动态加载工具),日积月累不容忽视。 更重要的是延迟:cache hit 走 KV cache 是几乎瞬时的 prefill,cache miss 要把 8000+ token 完整重算 —— 对响应延迟敏感的交互场景影响很大。 6. 修订上一份报告的结论之前的报告里我写「单测 + tsx 端到端 26/26 pass」,那确实验证了代码逻辑没问题,但没击中 PR 想堵的真实缓存行为。这次真实 API 验证补齐了这一块:
结论:强烈建议合并。 这不只是一个理论 fix,而是一个有真实缓存命中率影响的 bug —— 对所有用 Repro 复现路径: cd /private/tmp/pr4518-test
export DEEPSEEK_API_KEY=<your-key> # 或者读 ~/.qwen/settings.json env.DEEPSEEK_API_KEY_GMAIL
npx tsx scripts/deepseek-live-repro.mts— wenshao |
|
Cleaned up the PR description to match the latest template and incorporated the live DeepSeek API validation results from the maintainer report. Code is unchanged; current CI is green across macOS, Windows, and Linux. |
|
This PR shares modified code with PR #4505 (fix(core): emit enable_thinking on DashScope when reasoning is disabled). Shared functions:
Both PRs also modify files in the
Recommendation: These PRs both modify the OpenAI content generation pipeline. Coordinate with PR #4505 author before merging. This conflict was detected automatically by CodeScope skills — structural analysis of call graphs and file modifications. |
|
This PR shares modified code with PR #4505 (fix(core): emit enable_thinking on DashScope when reasoning is disabled). Shared functions:
Both PRs also modify files in the
Recommendation: These PRs both modify the OpenAI content generation pipeline. Coordinate with PR #4505 author before merging. This conflict was detected automatically by CodeScope skills — structural analysis of call graphs and file modifications. |
| interface ExtendedCompletionUsage extends OpenAI.CompletionUsage { | ||
| cached_tokens?: number; | ||
| prompt_cache_hit_tokens?: number; | ||
| prompt_cache_miss_tokens?: number; |
There was a problem hiding this comment.
[Suggestion] prompt_cache_miss_tokens is declared on ExtendedCompletionUsage but never read anywhere. getCachedPromptTokens only consumes prompt_cache_hit_tokens. The test fixture sets prompt_cache_miss_tokens: 105 but asserts nothing about it.
Dead interface surface — signals to future readers that this field is consumed when it is not. If a consumer is needed later (e.g., for cache-hit-rate observability), it should be added alongside the reading code.
| prompt_cache_miss_tokens?: number; |
(Remove this line and the corresponding prompt_cache_miss_tokens: 105 from the test fixture in converter.test.ts.)
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fixed in cf0a82c. Removed the unused prompt_cache_miss_tokens extension field and the fixture value, since this PR only consumes cache-hit counts.
| expect(response.candidates).toEqual([]); | ||
| }); | ||
|
|
||
| it('maps DeepSeek prompt cache hit tokens into cached content token count', () => { |
There was a problem hiding this comment.
[Suggestion] This test covers the non-streaming path (convertOpenAIResponseToGemini), but the streaming path (convertOpenAIChunkToGemini) also calls getCachedPromptTokens and has no corresponding test for prompt_cache_hit_tokens.
DeepSeek's streaming API may include prompt_cache_hit_tokens in the final chunk's usage object. The streaming usage assembly has its own estimation fallback logic (totalTokens > 0 && promptTokens === 0 && completionTokens === 0) that interacts with cachedTokens — this interaction is untested.
Consider adding a mirror test that feeds a chunk with prompt_cache_hit_tokens through convertOpenAIChunkToGemini and asserts cachedContentTokenCount is populated.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Added in cf0a82c. The streaming convertOpenAIChunkToGemini path now has coverage for provider-specific prompt_cache_hit_tokens in chunk usage metadata.
| function getCachedPromptTokens(usage: OpenAI.CompletionUsage): number { | ||
| const extendedUsage = usage as ExtendedCompletionUsage; | ||
| return ( | ||
| usage.prompt_tokens_details?.cached_tokens ?? |
There was a problem hiding this comment.
[Suggestion] No test verifies the ?? precedence chain: prompt_tokens_details.cached_tokens > cached_tokens > prompt_cache_hit_tokens. This PR adds a third fallback branch, making the relative ordering meaningful.
No existing test sends multiple cache-token fields simultaneously. If a future refactor reorders the chain, the wrong value could silently take precedence without any test catching it.
Consider adding a test with usage: { prompt_tokens_details: { cached_tokens: 100 }, cached_tokens: 200, prompt_cache_hit_tokens: 300 } asserting cachedContentTokenCount === 100.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Added in cf0a82c. The non-streaming cache test now covers the precedence chain with prompt_tokens_details.cached_tokens, top-level cached_tokens, and prompt_cache_hit_tokens, asserting the standard nested field wins.
| // serialized prompt prefix even when the tool set and schemas are identical. | ||
| // Sort only for official DeepSeek endpoints to avoid surprising other | ||
| // OpenAI-compatible providers with changed tool presentation order. | ||
| if (isDeepSeekHostname(this.contentGeneratorConfig)) { |
There was a problem hiding this comment.
[Suggestion] The tool sort is gated on isDeepSeekHostname() but KV-cache prefix-exactness is a property of the DeepSeek model architecture, not the endpoint hostname. Self-hosted DeepSeek deployments (sglang/vllm/ollama behind a proxy) share the same cache behavior but won't benefit from the sort.
The codebase already has isDeepSeekProvider() — which checks hostname or model name — for exactly this reason (see provider/deepseek.ts:58-64). The thinking: { type: 'disabled' } gate at line 405 is correctly hostname-scoped (wire-shape extension that self-hosted models may not accept), but the sort is a purely cache-optimization that is harmless for non-DeepSeek providers.
Consider using isDeepSeekProvider() here, or add a comment explaining why self-hosted DeepSeek is intentionally excluded from the cache optimization.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fixed in 5556284. Tool sorting now uses isDeepSeekProvider() so self-hosted DeepSeek models get the cache-stable ordering by model name, while the hostname-only thinking wire-shape gate is unchanged. Added a self-hosted DeepSeek regression test.
| // serialized prompt prefix even when the tool set and schemas are identical. | ||
| // Sort only for official DeepSeek endpoints to avoid surprising other | ||
| // OpenAI-compatible providers with changed tool presentation order. | ||
| if (isDeepSeekProvider(this.contentGeneratorConfig)) { |
There was a problem hiding this comment.
[Suggestion] Stale comment: the two lines above say "Sort only for official DeepSeek endpoints" but the gate here is isDeepSeekProvider, which also matches self-hosted DeepSeek deployments via model-name fallback (as confirmed by the sorts self-hosted DeepSeek tools by model name test). The comment was accurate when the gate was isDeepSeekHostname but was not updated when the gate was widened in this commit.
Suggested reword:
// DeepSeek's KV cache is prefix-exact: a different tool order changes the
// serialized prompt prefix even when the tool set and schemas are identical.
// Gate on isDeepSeekProvider so both official and self-hosted DeepSeek
// deployments benefit; non-DeepSeek providers keep caller order.— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fixed in 349f7f2. Updated the comment to match the widened isDeepSeekProvider() gate: official and self-hosted DeepSeek deployments get cache-stable ordering, while non-DeepSeek providers keep caller order.
…-prefix-stability-fix # Conflicts: # packages/core/src/core/openaiContentGenerator/pipeline.ts
| ).toEqual(['zeta', 'alpha']); | ||
| }); | ||
|
|
||
| it('sorts self-hosted DeepSeek tools by model name before sending the wire request', async () => { |
There was a problem hiding this comment.
[Suggestion] Test title says "by model name" but the sort key is tool.function.name. The model name is the detection gate (via isDeepSeekProvider), not the sort criterion. The sibling test at line 427 is correctly titled "sorts DeepSeek tools by function name". Consider aligning: e.g., "sorts tools by function name for self-hosted DeepSeek (detected via model name)".
— qwen3.7-max via Qwen Code /review
| (tool: OpenAI.Chat.ChatCompletionTool) => tool.function.name, | ||
| ), | ||
| ).toEqual(['alpha', 'bravo', 'zeta']); | ||
| expect(apiCall.thinking).toBeUndefined(); |
There was a problem hiding this comment.
[Suggestion] expect(apiCall.thinking).toBeUndefined() asserts a hostname-gate invariant (self-hosted DeepSeek does not receive the thinking wire-shape parameter) that is unrelated to tool ordering. This is a valuable invariant but mixing it into a tool-sort test makes failures harder to diagnose and hides it from anyone looking for thinking-parameter tests. Consider extracting it into its own test (e.g., "does not inject thinking parameter for self-hosted DeepSeek detected by model name") alongside the existing enable_thinking tests.
— qwen3.7-max via Qwen Code /review
| return 0; | ||
| } | ||
|
|
||
| function getToolSortKey(tool: OpenAI.Chat.ChatCompletionTool): string { |
There was a problem hiding this comment.
[Suggestion] getToolSortKey builds a composite key [name, type, JSON.stringify(tool)] but tool.function.name alone is a sufficient total order — function names are unique within a tool set (the Gemini→OpenAI converter enforces this upstream). The JSON.stringify tiebreaker is invoked O(n log n) times per request and serializes the entire tool schema (often the largest part of the request body). Simplifying to return tool.function.name; is correct, cheaper, and clearer about the contract.
| function getToolSortKey(tool: OpenAI.Chat.ChatCompletionTool): string { | |
| function getToolSortKey(tool: OpenAI.Chat.ChatCompletionTool): string { | |
| return tool.function.name; | |
| } |
— qwen3.7-max via Qwen Code /review
|
|
||
| // DeepSeek's KV cache is prefix-exact: a different tool order changes the | ||
| // serialized prompt prefix even when the tool set and schemas are identical. | ||
| // Gate on isDeepSeekProvider so both official and self-hosted DeepSeek |
There was a problem hiding this comment.
[Suggestion] The comment explains what the gate does but not why the broad isDeepSeekProvider (hostname OR model name) is correct here while reasoning_effort in the same file uses the narrow isDeepSeekHostname. The rationale is that cache-prefix stability depends on the model's KV-cache behavior (not the server's wire-shape extensions), so a false positive here is harmless — reordering tools is a no-op for backends without prefix-exact caching. A future maintainer may mechanically "unify" the two gates to the narrow one and silently regress self-hosted cache hit rates. Consider adding one line: e.g., "A false positive is low-cost: reordering tools is a no-op for backends without prefix-exact caching, unlike reasoning_effort where a false positive pushes an unsupported body parameter."
— qwen3.7-max via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
R6: All R5 suggestions addressed in f751874. Code is clean, well-tested, and correctly scoped. LGTM ✅ — qwen3.7-max via Qwen Code /review
Maintainer test reportTested PR head Static checks
Live DeepSeek API — premise: cache is prefix-exact wrt tool orderDirect calls to
Cache hits collapse from 3712 → 896 when tool order changes, and are fully restored to 3712 after applying the PR's sort. The 896-token residue in phase 3 is the system-prompt prefix that survives before the tool array diverges — confirms cache is prefix-exact and not whole-prompt-hashed, exactly as the PR description claims. Live DeepSeek API — end-to-end through the actual built pipelineImported Wire body in both calls ( Two things validated together:
Notes for reviewers
VerdictBehaves as described on Linux, end to end. LGTM from a maintainer-validation standpoint. |
BZ-D
left a comment
There was a problem hiding this comment.
LGTM. Sorting tools by function name for DeepSeek's prefix-exact KV cache is the right approach. isDeepSeekProvider (broad detection) is correctly used here instead of the narrower isDeepSeekHostname — cache-prefix stability applies to self-hosted DeepSeek too. The getCachedPromptTokens refactor with prompt_cache_hit_tokens fallback has clean precedence: standard > top-level > DeepSeek-specific.
|
This PR has merge conflicts with main — please rebase or merge main to resolve before merging. |
|
Thanks for the PR and the thorough write-up, @Jerry2003826. I'd like to close this pending a justifying issue, because we couldn't reproduce the problem the main change targets. Tool-ordering change: the wire The one case where an identical set can appear in a different order is across separate process starts, where concurrent MCP discovery registers tools in completion order. That's real but narrow: multi-MCP-server setups only, only the static prefix before the first user turn, and only across sessions — not the within-session cache where most of the savings are. Could you file an issue first with a concrete repro — ideally before/after |
What this PR does
Stabilizes DeepSeek tool-cache prefixes by keeping tool declarations in a deterministic order across turns.
Why it's needed
DeepSeek prompt caching is prefix-sensitive; tool-order drift can destroy cache hits even when the effective tool set is unchanged.
Reviewer Test Plan
How to verify
Run the targeted DeepSeek cache/tool-order tests added in this PR, then run eslint/prettier/typecheck for the touched core files. GitHub CI also validates lint and tests on macOS, Windows, and Linux.
Evidence (Before & After)
Before: logically identical turns could produce different tool declaration ordering and miss prefix cache reuse. After: regression coverage verifies stable ordering. This is provider prompt/caching behavior, so TUI screenshots are N/A.
Tested on
Environment (optional)
Local Windows/PowerShell checkout with repository npm workspaces. No tmux/TUI capture is included for PRs whose behavior is core, session, parser, or transport logic rather than a visible TUI state.
Risk & Scope
Linked Issues
References the DeepSeek cache-prefix stability issue discussed in this PR.
中文说明
这个 PR 做了什么
通过稳定工具声明顺序,改善 DeepSeek 前缀缓存稳定性。
为什么需要
DeepSeek 缓存对 prompt 前缀非常敏感;工具顺序漂移会让本应命中的缓存失效。
Reviewer Test Plan
本地按 PR 相关测试、eslint、prettier check、typecheck 验证;macOS/Linux 由 GitHub CI 验证。
风险和范围
只影响工具声明的确定性排序,不改变工具可用性或执行行为。