fix(core): include response tokens in prompt estimate#4525
fix(core): include response tokens in prompt estimate#4525Jerry2003826 wants to merge 11 commits into
Conversation
|
Updated the PR description to the latest template with Reviewer Test Plan, Evidence, Tested on, Risk & Scope, and Chinese summary. No code changes were needed for this update; CI is already green across macOS, Windows, Linux, lint, and CodeQL. |
| // Context usage tracks prompt size; output isn't in history yet. | ||
| const lastPromptTokenCount = | ||
| usageMetadata.promptTokenCount || usageMetadata.totalTokenCount; | ||
| this.lastCandidatesTokenCount = |
There was a problem hiding this comment.
[Suggestion] Two issues at this assignment site:
1. Double-counting via totalTokenCount fallback. When promptTokenCount is absent (line 2538), lastPromptTokenCount falls back to totalTokenCount, which already includes candidatesTokenCount per the Gemini API. Adding candidatesTokenCount again here means the next estimatePromptTokens call computes (prompt + candidates) + candidates + userEst — over-counting by exactly candidatesTokenCount. This path is reachable via OpenAI-compatible providers that only report total_tokens and completion_tokens.
Consider guarding so lastCandidatesTokenCount is only set when lastPromptTokenCount came from the authoritative promptTokenCount:
if (usageMetadata.promptTokenCount) {
this.lastCandidatesTokenCount = usageMetadata.candidatesTokenCount ?? 0;
}2. Missing thoughtsTokenCount. For thinking models (Gemini 2.5 Pro, etc.), thinking tokens are reported in a separate thoughtsTokenCount field that is NOT included in candidatesTokenCount. Thought parts are appended to history (line ~2686) and count toward the next promptTokenCount, but this field only captures visible output tokens. For a response with 15K thinking tokens and 500 visible output tokens, the estimate under-counts by 15K.
Consider including thinking tokens:
this.lastCandidatesTokenCount =
(usageMetadata.candidatesTokenCount ?? 0) +
(usageMetadata.thoughtsTokenCount ?? 0);(Consider renaming the field to lastOutputTokenCount to match the broader semantics.)
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fixed in 846ccf4. The usage metadata path now only carries previous-response output tokens when promptTokenCount is present, so the totalTokenCount fallback is not double-counted. It also adds thoughtsTokenCount to the carried output token count for thinking-model responses.
Added regressions for both cases: one where totalTokenCount + candidatesTokenCount must stay below the hard-tier force threshold, and one where promptTokenCount + candidatesTokenCount + thoughtsTokenCount must cross it.
| */ | ||
| setLastPromptTokenCount(count: number): void { | ||
| this.lastPromptTokenCount = count; | ||
| this.lastCandidatesTokenCount = 0; |
There was a problem hiding this comment.
[Suggestion] This line has a hidden side-effect: it zeroes lastCandidatesTokenCount. The JSDoc on setLastPromptTokenCount (lines 1320-1326) only mentions seeding prompt-token counts and doesn't document this reset. A future caller who uses this method to update the prompt count mid-session (not just at seeding time) will silently discard the candidates count from the last response.
Consider updating the JSDoc to mention the side-effect:
* Side-effect: also resets `lastCandidatesTokenCount` to 0, because
* externally-seeded prompt counts come from a different response than
* the one that populated `lastCandidatesTokenCount`.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Documented in 846ccf4. The JSDoc now explicitly says setLastPromptTokenCount() clears the remembered previous-response output token count because externally seeded prompt counts come from a different chat instance and should not inherit this chat's last response size.
1267375 to
846ccf4
Compare
| usageMetadata.promptTokenCount ?? usageMetadata.totalTokenCount; | ||
| this.lastCandidatesTokenCount = | ||
| usageMetadata.promptTokenCount !== undefined | ||
| ? (usageMetadata.candidatesTokenCount ?? 0) + |
There was a problem hiding this comment.
[Suggestion] For OpenAI-compatible providers, candidatesTokenCount already includes reasoning_tokens (the OpenAI converter at openaiContentGenerator/converter.ts sets candidatesTokenCount = completion_tokens, which includes reasoning_tokens, and separately sets thoughtsTokenCount = reasoning_tokens). Adding them here gives completion_tokens + reasoning_tokens — double-counting the reasoning portion.
Native Gemini is unaffected (candidatesTokenCount and thoughtsTokenCount are disjoint). But for OpenAI thinking models (e.g., o1/o3), this inflates lastCandidatesTokenCount by the size of reasoning_tokens (potentially 1K–10K per turn), making the hard-tier estimate more conservative than necessary.
Two possible fixes:
Option A — derive from totalTokenCount (provider-agnostic):
this.lastCandidatesTokenCount =
usageMetadata.promptTokenCount !== undefined
? Math.max(0, (usageMetadata.totalTokenCount ?? 0) - usageMetadata.promptTokenCount)
: 0;Option B — fix the OpenAI converter to subtract reasoning from candidates:
candidatesTokenCount: Math.max(0, completionTokens - thinkingTokens),Option A keeps the fix local; Option B aligns the converter with Gemini semantics (which may benefit other consumers too).
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Handled in 85668cd. The carried previous-response output token count now prefers totalTokenCount - promptTokenCount when totalTokenCount is available, so OpenAI-compatible completion_tokens that already include reasoning_tokens are not counted again via thoughtsTokenCount. Added a regression covering this exact OpenAI-compatible reasoning-token shape.
|
This PR shares code with PR #4526, PR #4528, and PR #4531 — all modifying core token estimation and prompt handling logic. Key shared functions:
Recommendation: All 4 PRs modify the same token estimation and compression pipeline. Review as a batch to ensure consistency in token counting logic. This conflict was detected automatically by CodeScope skills — structural analysis of call graphs and file modifications. |
|
This PR shares code with PR #4526, PR #4528, and PR #4531 — all modifying core token estimation and prompt handling logic. Key shared functions:
Recommendation: All 4 PRs modify the same token estimation and compression pipeline. Review as a batch to ensure consistency in token counting logic. This conflict was detected automatically by CodeScope skills — structural analysis of call graphs and file modifications. |
wenshao
left a comment
There was a problem hiding this comment.
[Suggestion] lastCandidatesTokenCount is not persisted/restored on session resume (--continue). sessionService.getResumePromptTokenCount extracts only promptTokenCount, and setLastPromptTokenCount resets lastCandidatesTokenCount = 0. After resume, the first send's estimate is lower than it should be by the last response's output tokens. Consider extracting candidatesTokenCount + thoughtsTokenCount from the saved usageMetadata and threading it through a new setLastCandidatesTokenCount method.
— qwen3.7-max via Qwen Code /review
| const lastPromptTokenCount = | ||
| usageMetadata.promptTokenCount || usageMetadata.totalTokenCount; | ||
| usageMetadata.promptTokenCount ?? usageMetadata.totalTokenCount; | ||
| this.lastCandidatesTokenCount = |
There was a problem hiding this comment.
[Suggestion] lastCandidatesTokenCount uses promptTokenCount !== undefined as its guard, but lastPromptTokenCount (line 2547) uses a truthy check on the ?? result. When promptTokenCount === 0, lastCandidatesTokenCount IS updated but lastPromptTokenCount is NOT — the two counters become desynchronized, and the next estimatePromptTokens mixes a stale prompt count with a fresh candidate count.
Consider moving lastCandidatesTokenCount inside the if (lastPromptTokenCount) block so both counters update atomically:
if (lastPromptTokenCount) {
this.lastPromptTokenCount = lastPromptTokenCount;
this.lastCandidatesTokenCount =
(usageMetadata.candidatesTokenCount ?? 0) +
(usageMetadata.thoughtsTokenCount ?? 0);
this.telemetryService?.setLastPromptTokenCount(lastPromptTokenCount);
}Also: the field name lastCandidatesTokenCount is slightly misleading — it stores candidatesTokenCount + thoughtsTokenCount (all output tokens), not just candidates. Renaming to lastOutputTokenCount would prevent future maintainainers from accidentally double-counting thoughts. A debugLogger.debug at the assignment site would also help with compaction debugging.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fixed in 75afe0c. The previous-output counter now updates only inside the same if (lastPromptTokenCount) block as lastPromptTokenCount, so a zero prompt count cannot refresh one counter while leaving the other stale.
| chat.getHistoryShallow(true), | ||
| pendingUserMessage, | ||
| originalTokenCount, | ||
| 0, |
There was a problem hiding this comment.
[Suggestion] The 0 here is lastCandidatesTokenCount. Consider adding an inline comment explaining why: the compression service lacks access to the chat's per-chat output token count, and the precomputedEffectiveTokens path (which already includes candidates) covers the common sendMessageStream caller.
| 0, | |
| 0, // lastCandidatesTokenCount: unavailable here; precomputedEffectiveTokens covers the common path |
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Handled in 75afe0c. Added the inline comment explaining that this path has no per-chat output-token state and that the common sendMessageStream caller uses precomputedEffectiveTokens, which already includes previous output tokens.
| passedOpts.pendingUserMessage?.parts?.some( | ||
| (part) => part.text === userMessage, | ||
| ), | ||
| ).toBe(true); |
There was a problem hiding this comment.
[Suggestion] Two test gaps:
-
Compression-success reset: the existing test "resets previous response candidate tokens when seeding last prompt tokens externally" covers the
setLastPromptTokenCountreset path (line 1332) but not the compression-success reset at line 1375. Add a test that primeslastCandidatesTokenCount, triggers successful compression, and asserts the follow-up'sprecomputedEffectiveTokensdoes NOT include pre-compression candidate tokens. -
Candidates ignored on fallback: no test verifies
estimatePromptTokens(history, user, 0, 9999)ignoreslastCandidatesTokenCountwhenlastPromptTokenCountis 0:
it('ignores lastCandidatesTokenCount when lastPromptTokenCount is 0', () => {
const fullEst = estimateContentTokens([...history, user]);
expect(estimatePromptTokens(history, user, 0, 9999)).toBe(fullEst);
});— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Added in 75afe0c. New coverage verifies successful compression clears the previous-output token contribution before the next estimate, and estimatePromptTokens(..., 0, lastCandidatesTokenCount) ignores candidate tokens on the first-send fallback path.
|
Handled in 1155387. Resume now restores both sides of the prompt estimate:
Validation:
|
wenshao
left a comment
There was a problem hiding this comment.
R4 review at 1155387. All R3 inline comments (guard mismatch, undocumented 0, compression-reset test gap) have been addressed — nice work. CI green (8/8), 437 tests pass. Two new Suggestions below: shared output-token formula to prevent drift, and a stream-path test for the candidates + thoughts fallback branch.
— qwen3.7-max via Qwen Code /review
| usageMetadata.promptTokenCount !== undefined | ||
| ? usageMetadata.totalTokenCount !== undefined | ||
| ? Math.max( | ||
| 0, |
There was a problem hiding this comment.
[Suggestion] The nested ternary here duplicates the same three-branch output-token derivation as getUsageOutputTokenCountForPromptEstimate() in sessionService.ts (added in this same PR). Both implement: (1) no promptTokenCount → 0, (2) totalTokenCount present → total - prompt, (3) fallback → candidates + thoughts. When a new provider reports tokens differently, a maintainer may update one copy and miss the other, causing resume vs. steady-state to produce different lastCandidatesTokenCount values.
| 0, | |
| this.lastCandidatesTokenCount = | |
| getUsageOutputTokenCountForPromptEstimate(usageMetadata); |
(Export getUsageOutputTokenCountForPromptEstimate from sessionService.ts or move it to tokenEstimation.ts so both call sites share one implementation.)
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Handled in 9a60577. Moved the output-token derivation into shared getUsageOutputTokenCountForPromptEstimate() in tokenEstimation.ts; both steady-state stream handling and resume restoration now use the same helper.
|
|
||
| it('resets previous response candidate tokens when seeding last prompt tokens externally', async () => { | ||
| const compressSpy = vi | ||
| .spyOn(ChatCompressionService.prototype, 'compress') |
There was a problem hiding this comment.
[Suggestion] The stream handler's candidatesTokenCount + thoughtsTokenCount fallback branch (geminiChat.ts:2564-2565 — fires when promptTokenCount is defined but totalTokenCount is undefined) is not exercised by any stream-path test. All three existing tests that provide promptTokenCount also provide totalTokenCount. The equivalent computation in getUsageOutputTokenCountForPromptEstimate IS tested for this scenario ("should restore disjoint candidate and thought output tokens when total is unavailable"), but the stream handler's duplicate is not.
Add a stream-path test modeled on "includes previous response thought tokens" but with totalTokenCount omitted from mock usageMetadata:
mockResolvedValueOnce(
makeStreamResponse('first', {
promptTokenCount: 176_000,
candidatesTokenCount: 1_200,
thoughtsTokenCount: 300,
// totalTokenCount intentionally omitted
}),
)Then assert that precomputedEffectiveTokens on the follow-up call reflects 176_000 + 1_500 (prompt + candidates + thoughts).
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Added in 9a60577. The stream path now has coverage for usage metadata with promptTokenCount, candidatesTokenCount, and thoughtsTokenCount while totalTokenCount is omitted, and the follow-up hard-tier estimate includes candidates + thoughts.
| setLastCandidatesTokenCount(count: number): void { | ||
| this.lastCandidatesTokenCount = Math.max(0, count); | ||
| } | ||
|
|
There was a problem hiding this comment.
[Suggestion] setLastCandidatesTokenCount uses Math.max(0, count), which does not guard against NaN — Math.max(0, NaN) returns NaN. If a future caller passes NaN (e.g., from unchecked arithmetic on usage metadata), lastCandidatesTokenCount becomes NaN, propagating through estimatePromptTokens and making effectiveTokens >= hard evaluate to false — silently disabling the hard-tier rescue gate with no observable signal.
| setLastCandidatesTokenCount(count: number): void { | |
| this.lastCandidatesTokenCount = Number.isFinite(count) ? Math.max(0, count) : 0; | |
| } |
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fixed in 11d064d. The old candidates setter was replaced by seedResumeTokenCounts(), which normalizes non-finite output counts to 0 before storing them.
| * under-counting the next request near the hard compaction threshold. | ||
| */ | ||
| private lastCandidatesTokenCount = 0; | ||
|
|
There was a problem hiding this comment.
[Suggestion] The field lastCandidatesTokenCount (and the setter/parameter names throughout) stores candidatesTokenCount + thoughtsTokenCount — the total non-prompt output — not just the Gemini API's candidatesTokenCount. A future maintainer reading the field name will assume it maps 1:1 to the API's candidatesTokenCount and may omit thoughtsTokenCount in new code paths, or double-count thoughts if added separately.
Consider renaming to lastOutputTokenCount across the field, getter/setter, and the estimatePromptTokens parameter to match the semantic (total output tokens from the previous response).
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fixed in 11d064d. The per-chat counter and estimatePromptTokens parameter are now named lastOutputTokenCount, matching the candidates+thoughts semantics instead of the raw Gemini field name.
| uiTelemetryService.getLastPromptTokenCount(), | ||
| ); | ||
| if (resumeTokenCounts) { | ||
| chat.setLastCandidatesTokenCount( |
There was a problem hiding this comment.
[Suggestion] The resume flow depends on a strict call order: setLastPromptTokenCount (which silently zeroes lastCandidatesTokenCount as a side-effect), then setLastCandidatesTokenCount (which restores the value). Nothing at the type or API level enforces this ordering — a future refactor that reorders or wraps these calls will silently zero the candidates count with no compile error or runtime error.
Consider introducing an atomic seeder method to make the coupling explicit:
| chat.setLastCandidatesTokenCount( | |
| if (resumeTokenCounts) { | |
| chat.seedResumeTokenCounts( | |
| resumeTokenCounts.promptTokenCount, | |
| resumeTokenCounts.candidatesTokenCount, | |
| ); | |
| } |
Keep setLastPromptTokenCount (with its zeroing side-effect) for subagent/post-compression paths where zeroing is correct.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fixed in 11d064d. Resume now seeds prompt/output counts through seedResumeTokenCounts(promptTokenCount, outputTokenCount), so the two restored counters are set atomically.
| usageMetadata.promptTokenCount ?? usageMetadata.totalTokenCount; | ||
| if (lastPromptTokenCount) { | ||
| // Always update the per-chat counter so this chat (including | ||
| // subagents) can make its own compaction decisions. |
There was a problem hiding this comment.
[Suggestion] The || to ?? change here is a behavioral change: when promptTokenCount === 0, the old || would fall through to totalTokenCount (entering the update block), while ?? keeps 0, causing if (lastPromptTokenCount) to be falsy and skipping both counter updates. No test exercises promptTokenCount: 0 to pin the intended behavior. Without a regression test, a future maintainer could revert to || thinking the change was accidental, reintroducing the original over-counting.
Consider adding a test with usageMetadata: { promptTokenCount: 0, totalTokenCount: 5000 } that documents the expected behavior (counters remain unchanged).
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fixed in 11d064d. Added does not replace token counters when usage reports zero prompt tokens, which pins that promptTokenCount: 0 does not reseed counters from totalTokenCount.
| ? Math.max( | ||
| 0, | ||
| usageMetadata.totalTokenCount - | ||
| usageMetadata.promptTokenCount, |
There was a problem hiding this comment.
[Suggestion] The candidatesTokenCount + thoughtsTokenCount fallback branch in getUsageOutputTokenCountForPromptEstimate (when promptTokenCount is defined but totalTokenCount is absent) is not tested through the stream handler path. Every test with promptTokenCount also provides totalTokenCount, so the ternary always takes the subtraction branch. The same function IS tested via sessionService.test.ts ("should restore disjoint candidate and thought output tokens when total is unavailable"), but the stream handler's invocation of this branch remains uncovered.
Providers that return promptTokenCount without totalTokenCount rely on this untested path for correct compaction thresholds.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Covered in 9a60577 by includes disjoint candidate and thought tokens when total token count is unavailable; the stream-path fixture omits totalTokenCount and verifies the next hard-tier estimate includes candidates + thoughts.
wenshao
left a comment
There was a problem hiding this comment.
R5 review at 9a60577 — incremental review of 2 new commits since R4. Both R4 Suggestions addressed: (1) output-token formula extracted into shared getUsageOutputTokenCountForPromptEstimate() in tokenEstimation.ts; (2) stream-path fallback branch now has test coverage. All R3 inline comments resolved (guard mismatch, undocumented zero, compression-reset test gap). Downgraded from Approve to Comment: CI still running. tsc clean, eslint clean, 234 targeted tests pass (geminiChat 155 + tokenEstimation 10 + chatCompressionService 69), 9432 full-suite tests pass. One low-confidence Suggestion (terminal-only, not posted inline): shared helper lacks direct unit tests in tokenEstimation.test.ts — integration tests cover the branches but direct tests would pin the contract. — qwen3.7-max via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
Cross-file observation: agent-core.ts:365 seeds lastPromptTokenCount on subagent fork chats but does not seed lastCandidatesTokenCount. Since setLastPromptTokenCount resets lastCandidatesTokenCount = 0, fork subagents will under-count the prompt estimate by the parent's previous response output tokens — the same under-counting this PR fixes for the main chat and session resume paths. Not a blocker for this PR (agent-core.ts is out of scope), but worth tracking as follow-up.
— qwen3.7-max via Qwen Code /review
| ); | ||
| this.getChat().setLastPromptTokenCount( | ||
| uiTelemetryService.getLastPromptTokenCount(), | ||
| const resumeTokenCounts = getResumeTokenCounts( |
There was a problem hiding this comment.
[Suggestion] getResumeTokenCounts(conversation) is called twice for the same conversation object during resume: once inside replayUiTelemetryFromConversation() at line 228 (which was refactored in this PR to call getResumeTokenCounts internally), and again here. Both walk messages backward — an O(n) traversal done redundantly.
Consider having replayUiTelemetryFromConversation return the ResumeTokenCounts | undefined it already computed, and consuming the return value here instead of recomputing:
// sessionService.ts
export function replayUiTelemetryFromConversation(
conversation: ConversationRecord,
): ResumeTokenCounts | undefined {
const resumeTokenCounts = getResumeTokenCounts(conversation);
if (resumeTokenCounts !== undefined) {
uiTelemetryService.setLastPromptTokenCount(resumeTokenCounts.promptTokenCount);
}
return resumeTokenCounts;
}
// client.ts
const resumeTokenCounts = replayUiTelemetryFromConversation(
resumedSessionData.conversation,
);— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fixed in 4b27a34. replayUiTelemetryFromConversation() now returns the ResumeTokenCounts it computes, and the resume path consumes that return value instead of walking the conversation a second time.
| promptTokenCount: 200, | ||
| candidatesTokenCount: 60, | ||
| thoughtsTokenCount: 20, | ||
| totalTokenCount: 260, |
There was a problem hiding this comment.
[Suggestion] The fixture has internally inconsistent token accounting: promptTokenCount: 200 + candidatesTokenCount: 60 + thoughtsTokenCount: 20 = 280, but totalTokenCount: 260. Because getUsageOutputTokenCountForPromptEstimate takes the totalTokenCount - promptTokenCount = 60 path when totalTokenCount is defined, it silently drops thoughtsTokenCount: 20 and the assertion toHaveBeenCalledWith(60) passes without verifying that thinking tokens are included.
Fix the fixture so totalTokenCount matches the sum (280), then assert toHaveBeenCalledWith(80) (candidates 60 + thoughts 20):
| totalTokenCount: 260, | |
| totalTokenCount: 280, |
| totalTokenCount: 260, | |
| expect(setLastCandidatesTokenCountSpy).toHaveBeenCalledWith(80); |
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fixed in 4b27a34. The fixture now uses consistent accounting (promptTokenCount: 200, candidatesTokenCount: 60, thoughtsTokenCount: 20, totalTokenCount: 280) and asserts the resumed seed receives output token count 80.
| if (usage.totalTokenCount !== undefined) { | ||
| return Math.max(0, usage.totalTokenCount - usage.promptTokenCount); | ||
| } | ||
| return (usage.candidatesTokenCount ?? 0) + (usage.thoughtsTokenCount ?? 0); |
There was a problem hiding this comment.
[Suggestion] This disjoint fallback lacks a Math.max(0, ...) wrapper, unlike the sibling totalTokenCount - promptTokenCount branch on line 94. The streaming path in geminiChat.ts:2559-2560 assigns the return value directly to this.lastCandidatesTokenCount (bypassing the Math.max(0, count) guard in setLastCandidatesTokenCount()), so a provider that returns negative candidatesTokenCount or thoughtsTokenCount would propagate an unclamped value.
Wrap for consistency:
| return (usage.candidatesTokenCount ?? 0) + (usage.thoughtsTokenCount ?? 0); | |
| return Math.max(0, (usage.candidatesTokenCount ?? 0) + (usage.thoughtsTokenCount ?? 0)); |
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fixed in 4b27a34. The disjoint candidates+thoughts fallback is clamped with Math.max(0, ...), and tokenEstimation.test.ts now covers negative disjoint output counts.
wenshao
left a comment
There was a problem hiding this comment.
R6 review at 4b27a34 — all R5 inline comments addressed (10 stale on old commits, 6 resolved with author replies on current commit). CI green (24/24), 373 tests pass (4 files). No high-confidence findings.
Low-confidence observations (for human review):
-
chatCompressionService.ts:364hardcodes0forlastOutputTokenCountbecauseGeminiChat.lastOutputTokenCountis private with no getter. The comment documents this and the commonsendMessageStreampath is safe (usesprecomputedEffectiveTokens), but a future caller ofcompress()without precomputed tokens would silently under-count. Consider adding agetLastOutputTokenCount()getter. -
tokenEstimation.test.tscovers only 1 of 3 branches ingetUsageOutputTokenCountForPromptEstimatedirectly (the negative-disjoint clamp). The undefined-usage → 0 and total-prompt branches are indirectly tested viagetResumeTokenCountsintegration tests but lack direct unit coverage. -
sessionService.test.tshas no test forgetResumeTokenCountsreturningundefined(empty conversation or no usable records). Callers guard this correctly but the contract isn't asserted.
Cross-file observation: contextCommand.ts:321 (not in this PR) has a stale TODO referencing the old estimatePromptTokens(history, undefined, 0, imageTokenEstimate) 4-arg signature. After this PR, imageTokenEstimate moved to position 5. Anyone implementing that TODO would silently misplace the argument.
— qwen3.7-max via Qwen Code /review
Local verification report (merge reference)Verified head SHA Reviewer Test Plan — all PASS
Wider regression sweep (also PASS)
CI on the PR head: 24/24 checks green (Test on macOS / Ubuntu / Windows × Node 22, Lint, CodeQL, Classify PR). Review summary
VerdictLGTM — verified locally; no blockers. Cleared for merge from my side. — wenshao |
BZ-D
left a comment
There was a problem hiding this comment.
整体逻辑清晰,边界处理到位。发现一个参数错位的 bug,详见行内评论。
| this.lastPromptTokenCount > 0 ? [] : this.getHistoryShallow(true), | ||
| userContent, | ||
| this.lastPromptTokenCount, | ||
| this.lastOutputTokenCount, |
There was a problem hiding this comment.
Bug: estimatePromptTokens 的签名新增了第 4 个参数 lastOutputTokenCount(插在 imageTokenEstimate 之前),但同文件下方 localPromptTokensAfterCompression 的计算(原 L1603-1608)没有同步更新:
const localPromptTokensAfterCompression = shouldForceFromHard
? estimatePromptTokens(
this.lastPromptTokenCount > 0 ? [] : this.getHistoryShallow(true),
userContent,
this.lastPromptTokenCount,
imageTokenEstimate, // ← 现在被当作 lastOutputTokenCount
)
: 0;imageTokenEstimate(通常 1600)被当成 lastOutputTokenCount 传入,而 imageTokenEstimate 则回退到默认值。恰好两个值目前相同所以不会出明显问题,但逻辑上是错的,用户自定义 imageTokenEstimate 时两个参数会同时出错。
建议加上 0:
estimatePromptTokens(
...,
this.lastPromptTokenCount,
0,
imageTokenEstimate,
)There was a problem hiding this comment.
I checked the current PR head (4b27a344e) and this appears to be stale against an older shape of the diff.
The localPromptTokensAfterCompression call site no longer exists (rg localPromptTokensAfterCompression returns no matches). The active geminiChat.ts call now passes both parameters in order:
estimatePromptTokens(
this.lastPromptTokenCount > 0 ? [] : this.getHistoryShallow(true),
userContent,
this.lastPromptTokenCount,
this.lastOutputTokenCount,
imageTokenEstimate,
)The compression-service fallback call site also uses the explicit unavailable-output-token value before the image estimate:
estimatePromptTokens(..., 0, slimmingConfig.imageTokenEstimate)No code change made for this thread unless there is another current call site I missed.
| } | ||
| if (usage.totalTokenCount !== undefined) { | ||
| return Math.max(0, usage.totalTokenCount - usage.promptTokenCount); | ||
| } |
There was a problem hiding this comment.
When totalTokenCount is unavailable, this sums candidatesTokenCount + thoughtsTokenCount assuming they are disjoint. Some OpenAI-compatible providers (e.g. certain DeepSeek-compatible endpoints) include thinking/reasoning tokens inside candidatesTokenCount, making them overlap. In that case this double-counts, inflating the prompt estimate by up to the full thoughtsTokenCount.
The totalTokenCount branch correctly avoids this, but this fallback branch does not. Over-estimating could trigger unnecessary hard-rescue compression one send earlier than needed.
Consider guarding against the overlap case, e.g.:
const candidates = usage.candidatesTokenCount ?? 0;
const thoughts = usage.thoughtsTokenCount ?? 0;
// If thoughts look like they are included in candidates, do not add them.
return Math.max(0, Math.max(candidates, candidates + thoughts));or documenting the known over-count as acceptable conservatism.
There was a problem hiding this comment.
Covered in the current branch. When totalTokenCount is unavailable, getUsageOutputTokenCountForPromptEstimate() now treats thoughtsTokenCount as potentially overlapping only when candidatesTokenCount strictly dominates it, so DeepSeek-compatible shapes that include reasoning inside candidates avoid double-counting. Verified with npm run test --workspace=packages/core -- src/services/tokenEstimation.test.ts, Prettier, and ESLint.
BZ-D
left a comment
There was a problem hiding this comment.
lastOutputTokenCount 的引入和 ?? vs || 的修正都很好。留了一个 inline comment 关于 getUsageOutputTokenCountForPromptEstimate 在 totalTokenCount 缺失时假设 candidates 和 thoughts disjoint —— 部分 provider 会把 thinking tokens 包含在 candidates 里导致 double-count。实际影响是可能提前一轮触发 hard-rescue,不 block,供参考。
|
Follow-up to @BZ-D's param-mismatch finding — it's a merge-order trap, not a same-branch miss @BZ-D is right that the post-compression That call does not exist in this branch yet. Head Merge preview confirms it. const localPromptTokensAfterCompression = shouldForceFromHard
? estimatePromptTokens(
this.lastPromptTokenCount > 0 ? [] : this.getHistoryShallow(true),
userContent,
this.lastPromptTokenCount,
imageTokenEstimate, // ← lands on the new `lastOutputTokenCount` param
)
: 0;This is silent on all three guards:
Impact — low-probability, fail-safe direction, but a real regression: Action items
This is also the concrete instance of @longbinlai's cross-PR conflict warning (the #4531 overlap on Dismissing my earlier R6 approval pending rebase + fix + a green CI run on the rebased branch, since the regression only surfaces post-merge. Good catch by @BZ-D either way. |
Dismissing pending rebase onto main (now includes #4531), the resulting estimatePromptTokens param fix at the localPromptTokensAfterCompression call site, and a green CI run on the rebased branch. This regression only surfaces post-merge — see my follow-up comment #4525 (comment) for the merge-tree evidence. Will re-review after rebase.
47265fc to
01a805c
Compare
wenshao
left a comment
There was a problem hiding this comment.
[Suggestion] Stale TODO in packages/cli/src/ui/commands/contextCommand.ts:329 references the old 4-arg estimatePromptTokens signature: estimatePromptTokens(history, undefined, 0, imageTokenEstimate). This PR inserts lastOutputTokenCount as the new 4th parameter, shifting imageTokenEstimate to 5th. A future implementor following this TODO would pass imageTokenEstimate as lastOutputTokenCount, producing a silently wrong estimate.
Suggested fix: update to estimatePromptTokens(history, undefined, 0, 0, imageTokenEstimate).
— qwen3.7-max via Qwen Code /review
| @@ -2632,11 +2669,13 @@ export class GeminiChat { | |||
| usageMetadata = chunk.usageMetadata; | |||
| // Context usage tracks prompt size; output isn't in history yet. | |||
| const lastPromptTokenCount = | |||
There was a problem hiding this comment.
[Suggestion] seedResumeTokenCounts (line 1375) correctly guards with Number.isFinite() + Math.max(0, ...), proving awareness of malformed-input risk. This streaming path assigns promptTokenCount ?? totalTokenCount directly with no finite/non-negative guard. A negative value (e.g. promptTokenCount: -1000) passes the truthy check, sets a negative counter, and causes estimatePromptTokens to under-count — silently disabling the compaction gate. getUsageOutputTokenCountForPromptEstimate also lacks a finite guard on its return value. Low probability with production providers but inconsistent with the resume path.
| const lastPromptTokenCount = | |
| const rawPromptCount = | |
| usageMetadata.promptTokenCount ?? usageMetadata.totalTokenCount; | |
| const lastPromptTokenCount = | |
| Number.isFinite(rawPromptCount) && rawPromptCount > 0 | |
| ? rawPromptCount | |
| : 0; |
And for the output count:
| const lastPromptTokenCount = | |
| const rawOutput = | |
| getUsageOutputTokenCountForPromptEstimate(usageMetadata); | |
| this.lastOutputTokenCount = Number.isFinite(rawOutput) | |
| ? rawOutput | |
| : 0; |
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Re-checked the current branch: this is already covered. The streaming path now uses hasUsablePromptTokenCount / hasUsableTotalTokenCount with typeof === number, Number.isFinite(...), and >= 0, then stores the coerced count. Invalid, negative, or non-finite usage no longer reaches lastPromptTokenCount from this path.
| return Math.max(0, usage.totalTokenCount - usage.promptTokenCount); | ||
| } | ||
| const candidates = Math.max(0, usage.candidatesTokenCount ?? 0); | ||
| const thoughts = Math.max(0, usage.thoughtsTokenCount ?? 0); |
There was a problem hiding this comment.
[Suggestion] At equality (candidatesTokenCount === thoughtsTokenCount, both non-zero), this returns only candidates, assuming overlap. But equality does not imply overlap — a provider could report equal but disjoint counts. Since this estimate is trigger-only (false-positive acceptable, false-negative is not), using > (strictly greater) would be safer: equality falls through to the additive branch.
| const thoughts = Math.max(0, usage.thoughtsTokenCount ?? 0); | |
| return candidates > thoughts ? candidates : candidates + thoughts; |
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Covered in the current branch. The overlap heuristic now uses candidates > thoughts, so equality is treated as ambiguous/disjoint and conservatively adds both counts. Added adds equal candidate and thought counts because equality does not prove overlap. Verified with npm run test --workspace=packages/core -- src/services/tokenEstimation.test.ts, Prettier, and ESLint.
BZ-D
left a comment
There was a problem hiding this comment.
Reviewed the diff and ran the targeted token accounting and resume tests; no blocking issues found.
…rompt-candidates-token-count
01a805c to
f0702b0
Compare
|
Updated in f0702b0 for the latest TODO-signature review. The stale comment in Validation:
|
|
Follow-up: merged current Resolution notes:
Validation run locally on Windows:
Note: when I initially ran several core Vitest commands in parallel on Windows, two runs hit the known |
wenshao
left a comment
There was a problem hiding this comment.
[Critical] [typecheck] packages/cli/src/ui/components/InputPrompt.test.tsx:205 — After the merge from main, BackgroundTaskViewState gained two new fields (livePanelFocused: boolean and livePanelSelectedIndex: number) but the mock at line 205 was not updated. tsc --noEmit fails with TS2345:
Argument of type '{ entries: never[]; selectedIndex: number; dialogMode: "closed"; dialogOpen: false; pillFocused: false; }' is not assignable to parameter of type 'BackgroundTaskViewState'.
Add the missing fields to the mock:
mockedUseBackgroundTaskViewState.mockReturnValue({
entries: [],
selectedIndex: 0,
dialogMode: 'closed',
dialogOpen: false,
pillFocused: false,
livePanelFocused: false,
livePanelSelectedIndex: 0,
});
— qwen3.7-max via Qwen Code /review
…rompt-candidates-token-count # Conflicts: # packages/core/src/core/geminiChat.ts
d476846 to
8a7921b
Compare
|
Updated in Change:
Validation on Windows:
Note: I also retried |
|
Addressed the equality case in Change:
Validation:
Note: one earlier parallel Vitest attempt hit the known Windows |
Verification ReportReviewer: wenshao Build
Tests
Code ReviewtokenEstimation.ts
geminiChat.ts
sessionService.ts
client.ts
chatCompressionService.ts
Verdict✅ LGTM — ready to merge. Fixes the prompt estimate under-counting by including response tokens. Defensive coercion of hostile provider values, atomic resume seeding, and hard-rescue rollback are all correctly implemented. All 673 affected tests pass. |
…rompt-candidates-token-count # Conflicts: # packages/core/src/core/geminiChat.test.ts
DragonnZhang
left a comment
There was a problem hiding this comment.
Qwen Code Review — PR #4525
Verdict: Looks good. No new findings.
Summary
This PR adds lastOutputTokenCount to the prompt-candidate estimation, closing an accuracy gap where previous model-response output tokens were not counted in the next prompt's size estimate. The change is well-scoped, touches the right layers (shared helper, per-chat state, resume restoration, compression reset), and is backed by thorough test coverage.
What was reviewed
- Deterministic checks: tsc and eslint clean (0 findings). 4 non-TS toolchains skipped (expected).
- Semantic review: All 11 changed files read in full. Key design decisions verified:
getUsageOutputTokenCountForPromptEstimatecorrectly handles thetotalTokenCountavailable/absent branches with appropriate clamping.seedResumeTokenCountsvalidates inputs withNumber.isFinite+Math.max(0, ...).setLastPromptTokenCountcorrectly clearslastOutputTokenCount(documented in JSDoc).- Compression success resets
lastOutputTokenCountto 0. - All call sites of
estimatePromptTokensupdated for the new 5-arg signature. - Subagent fork path (
agent-core.ts:368) correctly usessetLastPromptTokenCountwhich clears inherited output token state. chatCompressionService.tspasses 0 with an inline comment explaining the unavailability.
Prior review coverage
This PR has been extensively reviewed by @wenshao and @BZ-D over multiple rounds (R1–R6). All previously identified issues — double-counting via totalTokenCount fallback, hidden side-effect in setter, guard mismatch (|| vs ??), shared formula extraction, stale TODO signature, NaN propagation, resume call-order fragility, OpenAI reasoning-token overlap — have been addressed in the current commit. The remaining 12 stale inline comments are on old commits and already resolved.
— qwen-code via Qwen Code /review
|
I re-checked the current branch. This is already covered in the latest diff: the streaming usage path no longer assigns Current behavior:
So invalid / negative / non-finite prompt usage no longer reaches |
What this PR does
Includes response-side tokens when estimating prompt candidates so request-size checks account for the full conversation payload more accurately.
Why it's needed
Ignoring response tokens can underestimate the size of candidate histories and allow oversized histories to advance too far before later guards reject them.
Reviewer Test Plan
How to verify
Run the targeted prompt-estimate tests in packages/core, then run eslint/prettier/typecheck for the touched core files. The GitHub CI matrix also runs lint and tests on macOS, Windows, and Linux.
Evidence (Before & After)
Before: prompt candidate estimates could be lower than the actual model-facing history. After: regression coverage verifies response tokens are included in the estimate. This is core estimation behavior, so TUI screenshots are N/A.
Tested on
Environment (optional)
Local Windows/PowerShell checkout with repository npm workspaces. No tmux/TUI capture is included for PRs whose behavior is core, session, parser, or transport logic rather than a visible TUI state.
Risk & Scope
Linked Issues
Fixes #4349
中文说明
这个 PR 做了什么
估算 prompt candidates 时纳入 response 侧 tokens,使请求大小判断更接近真实上下文。
为什么需要
忽略 response tokens 会低估历史大小,让过大的历史走到更晚阶段才失败。
Reviewer Test Plan
本地按 PR 相关 core 测试、lint、format check、typecheck 验证;macOS/Linux 由 GitHub CI 验证。
风险和范围
只影响候选历史的 token 估算,不改 provider API 或 UI。