-
Notifications
You must be signed in to change notification settings - Fork 2.5k
fix(core): stabilize DeepSeek tool cache prefix #4518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
ee823dc
cf0a82c
5556284
349f7f2
58f1d2c
f751874
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -36,6 +36,7 @@ const debugLogger = createDebugLogger('CONVERTER'); | |
| */ | ||
| interface ExtendedCompletionUsage extends OpenAI.CompletionUsage { | ||
| cached_tokens?: number; | ||
| prompt_cache_hit_tokens?: number; | ||
| } | ||
|
|
||
| export interface ExtendedChatCompletionAssistantMessageParam | ||
|
|
@@ -69,6 +70,16 @@ export interface ExtendedCompletionChunkDelta | |
| // so it preserves catch-rate without silently suppressing legitimate chunks. | ||
| const CUMULATIVE_DELTA_EXACT_REPEAT_MIN_LENGTH = 64; | ||
|
|
||
| function getCachedPromptTokens(usage: OpenAI.CompletionUsage): number { | ||
| const extendedUsage = usage as ExtendedCompletionUsage; | ||
| return ( | ||
| usage.prompt_tokens_details?.cached_tokens ?? | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Suggestion] No test verifies the No existing test sends multiple cache-token fields simultaneously. If a future refactor reorders the chain, the wrong value could silently take precedence without any test catching it. Consider adding a test with — qwen3.7-max via Qwen Code /review
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added in cf0a82c. The non-streaming cache test now covers the precedence chain with |
||
| extendedUsage.cached_tokens ?? | ||
| extendedUsage.prompt_cache_hit_tokens ?? | ||
| 0 | ||
| ); | ||
| } | ||
|
|
||
| // Once this many bytes have been emitted without entering cumulative mode the | ||
| // stream is almost certainly a standard incremental provider. Stop growing | ||
| // emittedText beyond this point to bound per-stream memory and CPU. The true | ||
|
|
@@ -1121,13 +1132,8 @@ export function convertOpenAIResponseToGemini( | |
| const promptTokens = usage.prompt_tokens || 0; | ||
| const completionTokens = usage.completion_tokens || 0; | ||
| const totalTokens = usage.total_tokens || 0; | ||
| // Support both formats: prompt_tokens_details.cached_tokens (OpenAI standard) | ||
| // and cached_tokens (some models return it at top level) | ||
| const extendedUsage = usage as ExtendedCompletionUsage; | ||
| const cachedTokens = | ||
| usage.prompt_tokens_details?.cached_tokens ?? | ||
| extendedUsage.cached_tokens ?? | ||
| 0; | ||
| // Support OpenAI and provider-specific cache usage fields. | ||
| const cachedTokens = getCachedPromptTokens(usage); | ||
| const thinkingTokens = | ||
| usage.completion_tokens_details?.reasoning_tokens || 0; | ||
|
|
||
|
|
@@ -1320,13 +1326,8 @@ export function convertOpenAIChunkToGemini( | |
| const totalTokens = usage.total_tokens || 0; | ||
| const thinkingTokens = | ||
| usage.completion_tokens_details?.reasoning_tokens || 0; | ||
| // Support both formats: prompt_tokens_details.cached_tokens (OpenAI standard) | ||
| // and cached_tokens (some models return it at top level) | ||
| const extendedUsage = usage as ExtendedCompletionUsage; | ||
| const cachedTokens = | ||
| usage.prompt_tokens_details?.cached_tokens ?? | ||
| extendedUsage.cached_tokens ?? | ||
| 0; | ||
| // Support OpenAI and provider-specific cache usage fields. | ||
| const cachedTokens = getCachedPromptTokens(usage); | ||
|
|
||
| // If we only have total tokens but no breakdown, estimate the split | ||
| // Typically input is ~70% and output is ~30% for most conversations | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -11,14 +11,34 @@ import { | |||||||||
| } from '@google/genai'; | ||||||||||
| import type { ContentGeneratorConfig } from '../contentGenerator.js'; | ||||||||||
| import { OpenAIContentConverter } from './converter.js'; | ||||||||||
| import { isDeepSeekHostname } from './provider/deepseek.js'; | ||||||||||
| import { isDeepSeekHostname, isDeepSeekProvider } from './provider/deepseek.js'; | ||||||||||
| import { openaiRequestCaptureContext } from './requestCaptureContext.js'; | ||||||||||
| import { StreamingToolCallParser } from './streamingToolCallParser.js'; | ||||||||||
| import { TaggedThinkingParser } from './taggedThinkingParser.js'; | ||||||||||
| import type { PipelineConfig, RequestContext } from './types.js'; | ||||||||||
| import { redactProxyError } from '../../utils/runtimeFetchOptions.js'; | ||||||||||
| import { runtimeDiagnostics } from '../../utils/runtimeDiagnostics.js'; | ||||||||||
|
|
||||||||||
| function compareStableStrings(left: string, right: string): number { | ||||||||||
| if (left < right) return -1; | ||||||||||
| if (left > right) return 1; | ||||||||||
| return 0; | ||||||||||
| } | ||||||||||
|
|
||||||||||
| function getToolSortKey(tool: OpenAI.Chat.ChatCompletionTool): string { | ||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Suggestion]
Suggested change
— qwen3.7-max via Qwen Code /review |
||||||||||
| return tool.function.name; | ||||||||||
| } | ||||||||||
|
|
||||||||||
| function sortToolsForCacheStableRequest( | ||||||||||
| request: OpenAI.Chat.ChatCompletionCreateParams, | ||||||||||
| ): void { | ||||||||||
| if (!request.tools || request.tools.length < 2) return; | ||||||||||
|
|
||||||||||
| request.tools = [...request.tools].sort((left, right) => | ||||||||||
| compareStableStrings(getToolSortKey(left), getToolSortKey(right)), | ||||||||||
| ); | ||||||||||
| } | ||||||||||
|
|
||||||||||
| /** | ||||||||||
| * Error thrown when the API returns an error embedded as stream content | ||||||||||
| * instead of a proper HTTP error. Some providers (e.g., certain OpenAI-compatible | ||||||||||
|
|
@@ -367,6 +387,16 @@ export class ContentGenerationPipeline { | |||||||||
| } | ||||||||||
| } | ||||||||||
|
|
||||||||||
| // DeepSeek's KV cache is prefix-exact: a different tool order changes the | ||||||||||
| // serialized prompt prefix even when the tool set and schemas are identical. | ||||||||||
| // Gate on broad provider detection because cache-prefix stability follows | ||||||||||
| // the DeepSeek model/protocol even for self-hosted deployments. The | ||||||||||
| // narrower hostname gate above is only for DeepSeek's official V4 | ||||||||||
| // `thinking` wire shape, which self-hosted servers may reject. | ||||||||||
| if (isDeepSeekProvider(this.contentGeneratorConfig)) { | ||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Suggestion] Stale comment: the two lines above say "Sort only for official DeepSeek endpoints" but the gate here is Suggested reword: // DeepSeek's KV cache is prefix-exact: a different tool order changes the
// serialized prompt prefix even when the tool set and schemas are identical.
// Gate on isDeepSeekProvider so both official and self-hosted DeepSeek
// deployments benefit; non-DeepSeek providers keep caller order.— qwen3.7-max via Qwen Code /review
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed in 349f7f2. Updated the comment to match the widened |
||||||||||
| sortToolsForCacheStableRequest(providerRequest); | ||||||||||
| } | ||||||||||
|
|
||||||||||
| return providerRequest; | ||||||||||
| } | ||||||||||
|
|
||||||||||
|
|
||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion] This test covers the non-streaming path (
convertOpenAIResponseToGemini), but the streaming path (convertOpenAIChunkToGemini) also callsgetCachedPromptTokensand has no corresponding test forprompt_cache_hit_tokens.DeepSeek's streaming API may include
prompt_cache_hit_tokensin the final chunk'susageobject. The streaming usage assembly has its own estimation fallback logic (totalTokens > 0 && promptTokens === 0 && completionTokens === 0) that interacts withcachedTokens— this interaction is untested.Consider adding a mirror test that feeds a chunk with
prompt_cache_hit_tokensthroughconvertOpenAIChunkToGeminiand assertscachedContentTokenCountis populated.— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in cf0a82c. The streaming
convertOpenAIChunkToGeminipath now has coverage for provider-specificprompt_cache_hit_tokensin chunk usage metadata.