Scope: follow-up to #4345 (auto-compaction three-tier ladder).
estimatePromptTokens's steady-state branch:
return lastPromptTokenCount + estimateContentTokens([userMessage])
That covers the input sent on the previous turn + the new user message, but misses the model response from the previous turn — which has been appended to history between the API response handler and the next turn's prompt-size estimate. The miss is typically 500–5000 tokens.
Why it matters now (didn't matter before)
Pre-#4345 the 70% threshold was far enough from the window edge that this under-count was inconsequential. The new hard tier sits only HARD_BUFFER (≈3K) from the window edge — well within one model response. When the real prompt has crossed hard but the estimate hasn't, hard-rescue doesn't fire and the API call overflows. Reactive recovery catches it (no data loss) but the user pays a doomed API round-trip first (~2-5s latency).
Proposed fix
Plumb lastCandidatesTokenCount (the API's candidatesTokenCount from the previous turn's usage metadata) alongside lastPromptTokenCount:
- New private field on
GeminiChat: lastCandidatesTokenCount
- Capture from
usageMetadata.candidatesTokenCount in the streaming response handler alongside promptTokenCount
- Reset to 0 in:
- external
setLastPromptTokenCount seeder (inherited history has no anchor)
- post-COMPRESSED branch (history rewritten → response absorbed into snapshot envelope, already counted in
info.newTokenCount)
estimatePromptTokens takes optional lastCandidatesTokenCount: number = 0; steady-state branch adds it. Cold-start branch (lastPromptTokenCount === 0) unchanged
Single production caller (sendMessageStream hard-rescue pre-call) passes this.lastCandidatesTokenCount.
Related
R10.1 of PR #4168 (archived at tag pr-4168-archive-pre-revert). Pure cherry-pickable from that commit.
Scope: follow-up to #4345 (auto-compaction three-tier ladder).
estimatePromptTokens's steady-state branch:That covers the input sent on the previous turn + the new user message, but misses the model response from the previous turn — which has been appended to
historybetween the API response handler and the next turn's prompt-size estimate. The miss is typically 500–5000 tokens.Why it matters now (didn't matter before)
Pre-#4345 the 70% threshold was far enough from the window edge that this under-count was inconsequential. The new hard tier sits only
HARD_BUFFER(≈3K) from the window edge — well within one model response. When the real prompt has crossedhardbut the estimate hasn't, hard-rescue doesn't fire and the API call overflows. Reactive recovery catches it (no data loss) but the user pays a doomed API round-trip first (~2-5s latency).Proposed fix
Plumb
lastCandidatesTokenCount(the API'scandidatesTokenCountfrom the previous turn's usage metadata) alongsidelastPromptTokenCount:GeminiChat:lastCandidatesTokenCountusageMetadata.candidatesTokenCountin the streaming response handler alongsidepromptTokenCountsetLastPromptTokenCountseeder (inherited history has no anchor)info.newTokenCount)estimatePromptTokenstakes optionallastCandidatesTokenCount: number = 0; steady-state branch adds it. Cold-start branch (lastPromptTokenCount === 0) unchangedSingle production caller (
sendMessageStreamhard-rescue pre-call) passesthis.lastCandidatesTokenCount.Related
R10.1 of PR #4168 (archived at tag
pr-4168-archive-pre-revert). Pure cherry-pickable from that commit.