fix(core): correct context-usage Footer for prompt size and Anthropic caches#4109
Conversation
…counting Anthropic reports the prompt across three mutually-exclusive fields — input_tokens, cache_read_input_tokens, cache_creation_input_tokens — but the adapter only summed input + cache_read, dropping the cache_creation bucket. On a fresh session that wrote the system prompt to cache, the reported promptTokenCount was off by the cache-creation amount. Extract the normalization into a shared helper used by both streaming and non-streaming paths, and add a guard for non-conforming providers that expose the Anthropic protocol but follow OpenAI-style accounting (input_tokens already covers the cache fields). When input_tokens is at least as large as both cache fields and at least one cache field is non-zero, trust input_tokens alone so we don't double-count.
…isplay The Footer's "context used" indicator is meant to track prompt size — how much of the context window the next request will carry. The current code preferred totalTokenCount (= prompt + output), so output tokens generated in the in-flight round were double-counted. Across turns this caused the % bar to oscillate non-monotonically: it could *decrease* between turns whenever the prior round's output was large. Flip the preference at every consumer site that drives the live counter: the per-stream-chunk update in the main chat, the per-round update in the subagent runtime (which drives auto-compaction), the session-resume walk, and the in-process agent panel's listener. Producer sites that expose total for billing/export are left unchanged.
E2E test reportSetup: macOS, tmux interactive mode, Procedure: send three prompts in sequence, capture the Footer's "N% context used" reading after each completes.
Pass criteria: across the three turns the Footer must be monotonically non-decreasing and reflect the real prompt size — never shrink, never under-report by an order of magnitude. Result: ✅ pass. 2.8% → 3.0% → 3.0%. How to reproduce the regression on |
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
… normalization The previous guard fell back to "input alone" whenever input_tokens was at least as large as both cache fields. In a real Anthropic conversation input_tokens grows past cache_creation_input_tokens as history accumulates, so the guard inevitably mis-classified every later turn as OpenAI-style and silently dropped the cache_creation portion from the displayed prompt size. The Footer would show a one-shot drop at the crossover point and then keep under-reporting by ~32k tokens. cache_creation_input_tokens is unique to Anthropic's protocol (OpenAI has no equivalent), so its presence is a strong signal the response follows real Anthropic semantics. Use that as the primary discriminator and only fall back to "input alone" when cache_creation is zero, cache reads are reported, and input already covers them — the actual OpenAI- on-Anthropic case the guard was meant to catch. Adds a regression test that locks in the crossover scenario.
…play # Conflicts: # packages/core/src/core/anthropicContentGenerator/anthropicContentGenerator.ts
4b2d3ad to
34ac7bc
Compare
|
Heads up that this also changes what the auto-compression threshold reads — worth calling out because at first glance it looks like a side effect of the Footer fix, but it's an improvement on its own. Previously the threshold compared The case where the preserved window alone pushes the next send past the model's hard ceiling is handled by the reactive compression path on an actual context-length error from the API. |
|
A note on the heuristic in Some Anthropic-compatible providers (Bedrock proxies, OpenRouter, etc.) ride the Anthropic protocol but report tokens in OpenAI style — That fallback can mis-fire on a real Anthropic turn whose user message is |
wenshao
left a comment
There was a problem hiding this comment.
Additional test gaps (not mapped to specific lines):
anthropicContentGenerator.test.ts— Streaming tests don't includecache_creation_input_tokensin event payloads. The newcacheCreationTokensaccumulation chain (message_start, message_delta, usage.ts) is untested at the integration level.converter.test.ts—convertAnthropicResponseToGeminitests only use{input_tokens: 3, output_tokens: 5}without cache fields. The forwarding ofcache_read_input_tokensandcache_creation_input_tokensthrough the converter is untested.usage.test.ts— Missing test for simultaneouscacheReadTokens > 0 && cacheCreationTokens > 0scenario (real Anthropic where a turn both reads existing cache and writes new cache).
— DeepSeek/deepseek-v4-pro via Qwen Code /review
…he-field plumbing
- Restore the `isFinite` guard on `lastPromptTokenCount`: the previous
`if (contextTok)` relaxation accepted `Infinity` (truthy), which a
malformed provider response could otherwise latch and poison the
downstream compaction math.
- Add unit coverage for the cache-field plumbing the PR introduced:
- usage.ts: real-Anthropic warm-turn case where `cache_read > 0` and
`cache_creation > 0` simultaneously (mid-conversation breakpoint
advance over an already-cached prefix).
- converter.ts: `convertAnthropicResponseToGemini` now exercised with
all three prompt buckets present to confirm both cache fields are
forwarded to `usageMetadata`.
- anthropicContentGenerator.ts: streaming pipeline test that includes
`cache_creation_input_tokens` in `message_start` and asserts the
accumulated `usageMetadata` carries it through to the final chunk.
|
Addressed the review in commit eb4a33e: Inline comments — replied + resolved on both threads. The heuristic concern and the prompt-only-vs-total compaction-seed concern were already addressed by earlier thread comments (see here and here). isFinite guard — restored. The relaxation to Test gaps — three targeted unit tests added:
All four affected suites pass (130 tests). Typecheck clean. |
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — DeepSeek/deepseek-v4-pro via Qwen Code /review
… caches (QwenLM#4109) * fix(core): include cache_creation_input_tokens in Anthropic prompt accounting Anthropic reports the prompt across three mutually-exclusive fields — input_tokens, cache_read_input_tokens, cache_creation_input_tokens — but the adapter only summed input + cache_read, dropping the cache_creation bucket. On a fresh session that wrote the system prompt to cache, the reported promptTokenCount was off by the cache-creation amount. Extract the normalization into a shared helper used by both streaming and non-streaming paths, and add a guard for non-conforming providers that expose the Anthropic protocol but follow OpenAI-style accounting (input_tokens already covers the cache fields). When input_tokens is at least as large as both cache fields and at least one cache field is non-zero, trust input_tokens alone so we don't double-count. * fix(core): prefer promptTokenCount over totalTokenCount for context display The Footer's "context used" indicator is meant to track prompt size — how much of the context window the next request will carry. The current code preferred totalTokenCount (= prompt + output), so output tokens generated in the in-flight round were double-counted. Across turns this caused the % bar to oscillate non-monotonically: it could *decrease* between turns whenever the prior round's output was large. Flip the preference at every consumer site that drives the live counter: the per-stream-chunk update in the main chat, the per-round update in the subagent runtime (which drives auto-compaction), the session-resume walk, and the in-process agent panel's listener. Producer sites that expose total for billing/export are left unchanged. * fix(core): use cache_creation as the discriminator in Anthropic usage normalization The previous guard fell back to "input alone" whenever input_tokens was at least as large as both cache fields. In a real Anthropic conversation input_tokens grows past cache_creation_input_tokens as history accumulates, so the guard inevitably mis-classified every later turn as OpenAI-style and silently dropped the cache_creation portion from the displayed prompt size. The Footer would show a one-shot drop at the crossover point and then keep under-reporting by ~32k tokens. cache_creation_input_tokens is unique to Anthropic's protocol (OpenAI has no equivalent), so its presence is a strong signal the response follows real Anthropic semantics. Use that as the primary discriminator and only fall back to "input alone" when cache_creation is zero, cache reads are reported, and input already covers them — the actual OpenAI- on-Anthropic case the guard was meant to catch. Adds a regression test that locks in the crossover scenario. * chore(core): address PR review — restore isFinite guard and cover cache-field plumbing - Restore the `isFinite` guard on `lastPromptTokenCount`: the previous `if (contextTok)` relaxation accepted `Infinity` (truthy), which a malformed provider response could otherwise latch and poison the downstream compaction math. - Add unit coverage for the cache-field plumbing the PR introduced: - usage.ts: real-Anthropic warm-turn case where `cache_read > 0` and `cache_creation > 0` simultaneously (mid-conversation breakpoint advance over an already-cached prefix). - converter.ts: `convertAnthropicResponseToGemini` now exercised with all three prompt buckets present to confirm both cache fields are forwarded to `usageMetadata`. - anthropicContentGenerator.ts: streaming pipeline test that includes `cache_creation_input_tokens` in `message_start` and asserts the accumulated `usageMetadata` carries it through to the final chunk. (cherry picked from commit 85c10c1)
Summary
promptTokenCountinstead oftotalTokenCount— a more accurate measure of context-window usage.Validation
Scope / Risk
tool_result-only with a tool output large enough to pushinput_tokenspastcache_read_input_tokens(e.g. a 50k+grepresult). The under-report is one-shot and self-correcting — the next turn that adds a fresh cache breakpoint restores the full reading. It doesn't compound across turns.claude-opus-4-7locally. Other Anthropic-compatible providers (Bedrock, DeepSeek's Anthropic mode, third-party proxies) were not exercised end-to-end, though their semantics are what the guard is designed for.Testing Matrix
Testing matrix notes:
Linked Issues / Bugs
Related to #4025 — the reporter mentions "Qoder CLI" so the auto-close keyword is intentionally omitted; the underlying inaccuracy they describe matches the bug fixed here in qwen-code.