fix(core): recover from truncated tool calls via multi-turn continuation#3313
Conversation
…ion (#3049) When large tool calls (e.g., WriteFile with big HTML) exceed the output token limit, the model's response gets truncated and required parameters like file_path are missing. Previously this surfaced as a confusing "params must have required property" error. Three-layer defense: 1. Escalate to model's actual output limit (not fixed 64K). Models with 128K output (Claude Opus, GPT-5) now use their full capacity. 2. Multi-turn recovery: if the escalated response is still truncated, keep the partial response in history and inject a recovery message ("Resume directly — pick up mid-thought") so the model continues from where it left off. Up to 3 recovery attempts before falling back to the tool scheduler's guidance. 3. Stronger truncation guidance as fallback: "you MUST split" instead of "consider splitting". Also fixes: - Clear toolCallRequests on RETRY to prevent duplicate tool execution - Add isContinuation flag to RETRY events so the UI preserves text buffers during recovery (continuation) but resets them during escalation (fresh restart) - Catch errors during recovery to prevent dangling history entries
📋 Review SummaryThis PR implements a robust three-layer defense against output token truncation when model responses exceed token limits, addressing issue #3049. The implementation is well-structured, follows existing patterns, and includes thoughtful error handling. The changes are focused and demonstrate good understanding of the streaming architecture. 🔍 General Feedback
🎯 Specific Feedback🟡 High
// Before the recovery attempt, add:
if (recoveryCount > 0) {
const delayMs = Math.min(1000 * recoveryCount, 3000); // Cap at 3s
await new Promise(resolve => setTimeout(resolve, delayMs));
}
// Add guard before pushing recovery message:
if (
self.history.length === 0 ||
self.history[self.history.length - 1].role !== 'user'
) {
self.history.push(createUserContent([{ text: OUTPUT_RECOVERY_MESSAGE }]));
}🟢 Medium
let escalatedFinishReason: string | undefined = undefined;
// Current code is fine since toolCallRequests is declared in scope,
// but adding a defensive check wouldn't hurt:
if (toolCallRequests && toolCallRequests.length > 0) {
toolCallRequests.length = 0;
}🔵 Low
✅ Highlights
|
…hanism Update the design doc to reflect: - Escalation now targets model's actual output limit (64K floor) - Multi-turn recovery loop after escalation (up to 3 attempts) - isContinuation flag on RETRY events - Recovery error handling (pop dangling message, break) - Updated constants table and model-specific escalation limits - New design decision: why multi-turn recovery over progressive escalation
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
wenshao
left a comment
There was a problem hiding this comment.
[Suggestion] The ACP cron path starts a background rewrite with this.messageRewriter.flushTurn(ac.signal), but unlike the normal prompt path it never waits for pending rewrites before returning. That can drop the final rewritten ACP progress update if the cron turn finishes immediately after scheduling the rewrite. Consider mirroring the normal prompt flow by awaiting this.messageRewriter.waitForPendingRewrites() before #executeCronPrompt() returns.
— gpt-5.4 via Qwen Code /review
tanzhenxin
left a comment
There was a problem hiding this comment.
Review
Adds a well-structured 3-layer recovery for truncated model output. The recovery loop is cleanly bounded by MAX_OUTPUT_RECOVERY_ATTEMPTS, and the two bug fixes (duplicate tool execution on RETRY, text duplication on escalation) are real and correct. However, the continuation path has two correctness bugs.
Issues
-
Partial text lost from display during continuation. On a continuation RETRY,
setPendingHistoryItem(null)clears the partial text. WhenhandleContentEventruns for the first continuation chunk, it sees a null pending item, creates a new one, and resetsgeminiMessageBuffer = eventValue— discarding the preserved buffer. The user loses the beginning of the recovered response. Suggestion: commit the partial text to the static item list viaaddItembefore clearing, or don't clearpendingHistoryItemRef.currentand let the continuation append to it. -
Recovery on truncated tool-call turns corrupts conversation state. If the truncated turn already emitted a complete
functionCall, the recovery injects a plain user message — creatingmodel(functionCall) → user(text)with nofunctionResponse. This is an invalid sequence for the API. Suggestion: skip multi-turn recovery when the truncated turn containsfunctionCallparts, and let the existing tool scheduler fallback (layer 3) handle those cases. -
Recovery errors swallowed after partial chunks emitted. A recovery attempt can yield chunks then fail. The catch block breaks without emitting an error event, leaving the UI with partial text and no failure indication. Suggestion: emit an error or warning event in the catch block.
Verdict
REQUEST_CHANGES — The text loss and conversation state corruption need to be fixed. Test coverage for the new recovery loop is also expected.
|
Thanks for the thorough review! Re High #1 (backoff between recovery attempts): Re High #2 (guard before pushing recovery message): This is already handled by the catch block. If a recovery attempt fails, the catch pops the dangling user message and breaks out of the loop. In the success path, Re Medium #3-5: Acknowledged as style preferences. Re Low #6-9: Good suggestions for future iterations. Keeping them out of this PR scope to minimize change surface. |
|
Hi @wenshao, looks like your earlier response was addressing the github-actions bot review. Could you take a look at the three critical issues I flagged in my review above?
Happy to discuss if you disagree on any of these — (2) in particular would be good to confirm against real API behavior. |
Three correctness fixes from @tanzhenxin's review: 1. Partial text lost during continuation (useGeminiStream.ts): On continuation RETRY, setPendingHistoryItem(null) cleared the pending gemini item. The next Content event then saw a null pending item, created a fresh one, and reset geminiMessageBuffer = eventValue — discarding the preserved partial text. Now the pending item AND buffers are kept on continuation, so the continuation appends. 2. Recovery on truncated tool-call turns (geminiChat.ts): When the truncated turn already contains a complete functionCall, appending a user recovery message produces model(functionCall) → user(text) with no intervening functionResponse — an invalid API sequence. Now recovery skips turns with functionCall parts and defers to the tool scheduler's layer-3 fallback. 3. Recovery errors swallowed after partial chunks (geminiChat.ts): If a recovery attempt yielded chunks then failed, the catch block broke without emitting any terminal signal, leaving the UI with partial text and no Finished event. Now emits a synthetic finishReason=STOP chunk in the catch so the UI gets a proper terminal signal.
|
@tanzhenxin You are absolutely right on all three — apologies for conflating your review with the bot review earlier. Addressed in f683830: 1. Partial text lost during continuation — Kept both 2. Recovery on truncated tool-call turns — Added a guard at the top of the recovery loop: if the last model entry in history contains any 3. Recovery errors swallowed after partial chunks — In the catch block, now emit a synthetic chunk with Regarding (2), I verified by tracing the history structure after the guard fires: the truncated model turn with functionCall stays intact, Will add targeted tests for the recovery loop next. |
Four targeted tests for the recovery mechanism introduced in the truncated-tool-call-recovery PR: 1. Recovery loop fires when escalated response is also truncated: initial MAX_TOKENS → escalation MAX_TOKENS → recovery STOP. Verifies two RETRY events (one escalation, one continuation) and three API calls. 2. Recovery is skipped when truncated turn contains a functionCall: prevents the invalid model(functionCall) → user(text) sequence. Verifies no continuation RETRY and history ends with the functionCall intact. 3. Recovery attempts are capped at MAX_OUTPUT_RECOVERY_ATTEMPTS (3): persistent MAX_TOKENS triggers exactly 5 API calls (1 initial + 1 escalation + 3 recovery). 4. Recovery catch block emits synthetic STOP chunk and pops dangling user message: when a recovery attempt fails (empty stream → InvalidStreamError), the UI gets a terminal signal and history ends on the model turn, not a dangling user recovery message.
|
Added 4 targeted recovery-loop tests in b9340c1 covering:
Total: 198 tests pass (194 existing + 4 new). |
Existing tests cover the functionCall guard when both initial and escalated responses have functionCall. This adds a test for the cross-iteration case: iter 1 returns text (recovery proceeds), iter 2 returns functionCall (recovery must break before iter 3). Verifies: - API called exactly 4 times (1 initial + 1 escalation + 2 recovery) - History ends with the functionCall model turn, not a dangling user recovery message - Iter 3's user recovery message is never pushed (guard fires at top of loop before recoveryCount increment)
The object literal {candidates, content, parts} doesn't structurally
overlap enough with GenerateContentResponse for TypeScript's strict
narrow cast. Casting through 'unknown' is required per TS2352.
Build error from CI:
src/core/geminiChat.ts(651,24): error TS2352: Conversion of type '...'
to type 'GenerateContentResponse' may be a mistake because neither
type sufficiently overlaps with the other. If this was intentional,
convert the expression to 'unknown' first.
|
Noting the earlier Only changed files in this PR:
|
Strengthen the "pop dangling recovery message" test to catch any future regression that leaves consecutive same-role entries or an empty last-model placeholder in history — conditions providers reject on the next turn.
Previously every output-token recovery iteration left a (user, model) pair in durable history where the user turn was the internal OUTPUT_RECOVERY_MESSAGE control prompt. That prompt was then visible to every later turn, biasing responses and polluting compression, replay, and export. Track successful recovery iterations and, after the recovery loop, fold each completed pair back into the preceding model turn via a new `coalesceRecoveryPairs` helper. Failed iterations already pop their user turn in the catch block, so they need no coalescing. Adds a targeted test that runs escalation + two successful recovery iterations + a clean STOP, and asserts the merged history has exactly one user turn and one model turn, no trace of the control prompt text, and content ordered as B (escalation) + C + D.
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — gpt-5.4 via Qwen Code /review
tanzhenxin
left a comment
There was a problem hiding this comment.
All three blockers from the prior review are addressed with targeted tests (partial-text preservation, functionCall guard with cross-iter coverage, synthetic STOP on catch). The post-loop coalesceRecoveryPairs cleanup for the control prompt is a nice catch — keeps durable history free of synthetic turns. LGTM.
…ion (#3313) * fix(core): recover from truncated tool calls via multi-turn continuation (#3049) When large tool calls (e.g., WriteFile with big HTML) exceed the output token limit, the model's response gets truncated and required parameters like file_path are missing. Previously this surfaced as a confusing "params must have required property" error. Three-layer defense: 1. Escalate to model's actual output limit (not fixed 64K). Models with 128K output (Claude Opus, GPT-5) now use their full capacity. 2. Multi-turn recovery: if the escalated response is still truncated, keep the partial response in history and inject a recovery message ("Resume directly — pick up mid-thought") so the model continues from where it left off. Up to 3 recovery attempts before falling back to the tool scheduler's guidance. 3. Stronger truncation guidance as fallback: "you MUST split" instead of "consider splitting". Also fixes: - Clear toolCallRequests on RETRY to prevent duplicate tool execution - Add isContinuation flag to RETRY events so the UI preserves text buffers during recovery (continuation) but resets them during escalation (fresh restart) - Catch errors during recovery to prevent dangling history entries * docs: update adaptive output token escalation design for recovery mechanism Update the design doc to reflect: - Escalation now targets model's actual output limit (64K floor) - Multi-turn recovery loop after escalation (up to 3 attempts) - isContinuation flag on RETRY events - Recovery error handling (pop dangling message, break) - Updated constants table and model-specific escalation limits - New design decision: why multi-turn recovery over progressive escalation * fix: remove competitor reference from code comment * fix: address review feedback on recovery mechanism Three correctness fixes from @tanzhenxin's review: 1. Partial text lost during continuation (useGeminiStream.ts): On continuation RETRY, setPendingHistoryItem(null) cleared the pending gemini item. The next Content event then saw a null pending item, created a fresh one, and reset geminiMessageBuffer = eventValue — discarding the preserved partial text. Now the pending item AND buffers are kept on continuation, so the continuation appends. 2. Recovery on truncated tool-call turns (geminiChat.ts): When the truncated turn already contains a complete functionCall, appending a user recovery message produces model(functionCall) → user(text) with no intervening functionResponse — an invalid API sequence. Now recovery skips turns with functionCall parts and defers to the tool scheduler's layer-3 fallback. 3. Recovery errors swallowed after partial chunks (geminiChat.ts): If a recovery attempt yielded chunks then failed, the catch block broke without emitting any terminal signal, leaving the UI with partial text and no Finished event. Now emits a synthetic finishReason=STOP chunk in the catch so the UI gets a proper terminal signal. * test: add coverage for output token recovery loop Four targeted tests for the recovery mechanism introduced in the truncated-tool-call-recovery PR: 1. Recovery loop fires when escalated response is also truncated: initial MAX_TOKENS → escalation MAX_TOKENS → recovery STOP. Verifies two RETRY events (one escalation, one continuation) and three API calls. 2. Recovery is skipped when truncated turn contains a functionCall: prevents the invalid model(functionCall) → user(text) sequence. Verifies no continuation RETRY and history ends with the functionCall intact. 3. Recovery attempts are capped at MAX_OUTPUT_RECOVERY_ATTEMPTS (3): persistent MAX_TOKENS triggers exactly 5 API calls (1 initial + 1 escalation + 3 recovery). 4. Recovery catch block emits synthetic STOP chunk and pops dangling user message: when a recovery attempt fails (empty stream → InvalidStreamError), the UI gets a terminal signal and history ends on the model turn, not a dangling user recovery message. * test: cover cross-iteration functionCall detection in recovery loop Existing tests cover the functionCall guard when both initial and escalated responses have functionCall. This adds a test for the cross-iteration case: iter 1 returns text (recovery proceeds), iter 2 returns functionCall (recovery must break before iter 3). Verifies: - API called exactly 4 times (1 initial + 1 escalation + 2 recovery) - History ends with the functionCall model turn, not a dangling user recovery message - Iter 3's user recovery message is never pushed (guard fires at top of loop before recoveryCount increment) * fix(core): cast synthetic STOP chunk via unknown for TS2352 The object literal {candidates, content, parts} doesn't structurally overlap enough with GenerateContentResponse for TypeScript's strict narrow cast. Casting through 'unknown' is required per TS2352. Build error from CI: src/core/geminiChat.ts(651,24): error TS2352: Conversion of type '...' to type 'GenerateContentResponse' may be a mistake because neither type sufficiently overlaps with the other. If this was intentional, convert the expression to 'unknown' first. * test(core): tighten recovery history integrity assertions Strengthen the "pop dangling recovery message" test to catch any future regression that leaves consecutive same-role entries or an empty last-model placeholder in history — conditions providers reject on the next turn. * fix(core): coalesce recovery pairs to avoid leaking control prompt Previously every output-token recovery iteration left a (user, model) pair in durable history where the user turn was the internal OUTPUT_RECOVERY_MESSAGE control prompt. That prompt was then visible to every later turn, biasing responses and polluting compression, replay, and export. Track successful recovery iterations and, after the recovery loop, fold each completed pair back into the preceding model turn via a new `coalesceRecoveryPairs` helper. Failed iterations already pop their user turn in the catch block, so they need no coalescing. Adds a targeted test that runs escalation + two successful recovery iterations + a clean STOP, and asserts the merged history has exactly one user turn and one model turn, no trace of the control prompt text, and content ordered as B (escalation) + C + D.
…ion (QwenLM#3313) * fix(core): recover from truncated tool calls via multi-turn continuation (QwenLM#3049) When large tool calls (e.g., WriteFile with big HTML) exceed the output token limit, the model's response gets truncated and required parameters like file_path are missing. Previously this surfaced as a confusing "params must have required property" error. Three-layer defense: 1. Escalate to model's actual output limit (not fixed 64K). Models with 128K output (Claude Opus, GPT-5) now use their full capacity. 2. Multi-turn recovery: if the escalated response is still truncated, keep the partial response in history and inject a recovery message ("Resume directly — pick up mid-thought") so the model continues from where it left off. Up to 3 recovery attempts before falling back to the tool scheduler's guidance. 3. Stronger truncation guidance as fallback: "you MUST split" instead of "consider splitting". Also fixes: - Clear toolCallRequests on RETRY to prevent duplicate tool execution - Add isContinuation flag to RETRY events so the UI preserves text buffers during recovery (continuation) but resets them during escalation (fresh restart) - Catch errors during recovery to prevent dangling history entries * docs: update adaptive output token escalation design for recovery mechanism Update the design doc to reflect: - Escalation now targets model's actual output limit (64K floor) - Multi-turn recovery loop after escalation (up to 3 attempts) - isContinuation flag on RETRY events - Recovery error handling (pop dangling message, break) - Updated constants table and model-specific escalation limits - New design decision: why multi-turn recovery over progressive escalation * fix: remove competitor reference from code comment * fix: address review feedback on recovery mechanism Three correctness fixes from @tanzhenxin's review: 1. Partial text lost during continuation (useGeminiStream.ts): On continuation RETRY, setPendingHistoryItem(null) cleared the pending gemini item. The next Content event then saw a null pending item, created a fresh one, and reset geminiMessageBuffer = eventValue — discarding the preserved partial text. Now the pending item AND buffers are kept on continuation, so the continuation appends. 2. Recovery on truncated tool-call turns (geminiChat.ts): When the truncated turn already contains a complete functionCall, appending a user recovery message produces model(functionCall) → user(text) with no intervening functionResponse — an invalid API sequence. Now recovery skips turns with functionCall parts and defers to the tool scheduler's layer-3 fallback. 3. Recovery errors swallowed after partial chunks (geminiChat.ts): If a recovery attempt yielded chunks then failed, the catch block broke without emitting any terminal signal, leaving the UI with partial text and no Finished event. Now emits a synthetic finishReason=STOP chunk in the catch so the UI gets a proper terminal signal. * test: add coverage for output token recovery loop Four targeted tests for the recovery mechanism introduced in the truncated-tool-call-recovery PR: 1. Recovery loop fires when escalated response is also truncated: initial MAX_TOKENS → escalation MAX_TOKENS → recovery STOP. Verifies two RETRY events (one escalation, one continuation) and three API calls. 2. Recovery is skipped when truncated turn contains a functionCall: prevents the invalid model(functionCall) → user(text) sequence. Verifies no continuation RETRY and history ends with the functionCall intact. 3. Recovery attempts are capped at MAX_OUTPUT_RECOVERY_ATTEMPTS (3): persistent MAX_TOKENS triggers exactly 5 API calls (1 initial + 1 escalation + 3 recovery). 4. Recovery catch block emits synthetic STOP chunk and pops dangling user message: when a recovery attempt fails (empty stream → InvalidStreamError), the UI gets a terminal signal and history ends on the model turn, not a dangling user recovery message. * test: cover cross-iteration functionCall detection in recovery loop Existing tests cover the functionCall guard when both initial and escalated responses have functionCall. This adds a test for the cross-iteration case: iter 1 returns text (recovery proceeds), iter 2 returns functionCall (recovery must break before iter 3). Verifies: - API called exactly 4 times (1 initial + 1 escalation + 2 recovery) - History ends with the functionCall model turn, not a dangling user recovery message - Iter 3's user recovery message is never pushed (guard fires at top of loop before recoveryCount increment) * fix(core): cast synthetic STOP chunk via unknown for TS2352 The object literal {candidates, content, parts} doesn't structurally overlap enough with GenerateContentResponse for TypeScript's strict narrow cast. Casting through 'unknown' is required per TS2352. Build error from CI: src/core/geminiChat.ts(651,24): error TS2352: Conversion of type '...' to type 'GenerateContentResponse' may be a mistake because neither type sufficiently overlaps with the other. If this was intentional, convert the expression to 'unknown' first. * test(core): tighten recovery history integrity assertions Strengthen the "pop dangling recovery message" test to catch any future regression that leaves consecutive same-role entries or an empty last-model placeholder in history — conditions providers reject on the next turn. * fix(core): coalesce recovery pairs to avoid leaking control prompt Previously every output-token recovery iteration left a (user, model) pair in durable history where the user turn was the internal OUTPUT_RECOVERY_MESSAGE control prompt. That prompt was then visible to every later turn, biasing responses and polluting compression, replay, and export. Track successful recovery iterations and, after the recovery loop, fold each completed pair back into the preceding model turn via a new `coalesceRecoveryPairs` helper. Failed iterations already pop their user turn in the catch block, so they need no coalescing. Adds a targeted test that runs escalation + two successful recovery iterations + a clean STOP, and asserts the merged history has exactly one user turn and one model turn, no trace of the control prompt text, and content ordered as B (escalation) + C + D.
Summary
Fixes #3049 — WriteFile/ReadFile tool calls fail with
params must have required property 'file_path'when model output is truncated due to max_tokens limit.Root cause
When the model generates a large tool call (e.g., WriteFile with a big HTML file), the response can exceed the output token limit. The adaptive escalation (#2898) retries at 64K, but:
file_pathmissing) and the user sees a confusing validation errorRecovery flow / 恢复流程
Changes / 改动详情
1. Escalate to the model's actual output limit / 提升到模型实际输出上限
Instead of a fixed 64K escalation target, use
Math.max(ESCALATED_MAX_TOKENS, tokenLimit(model, 'output')).不再固定提升到 64K,而是取 64K 与模型实际输出上限的较大值:
2. Multi-turn recovery after escalation / 提升后多轮续写恢复
When the escalated response is still truncated (finishReason === MAX_TOKENS):
functionCall, recovery is skipped to avoid producing an invalidmodel(functionCall) → user(text)API sequence. The tool scheduler's layer-3 fallback handles it.当提升后响应仍然被截断时:若截断轮含
functionCall则跳过恢复(交给 layer 3,避免破坏 API 序列合约);否则保留部分响应并注入恢复消息,让模型从断点续写,最多 3 次。The recovery RETRY uses
isContinuation: trueso the UI preserves the text buffer (appending continuation text) rather than resetting it (as done for escalation).恢复 RETRY 使用
isContinuation: true,UI 保留文本缓冲区(追加续写内容),而非像提升重试那样重置。If a recovery attempt fails (e.g., empty response, network error), the catch block pops the dangling recovery message, emits a synthetic
STOPchunk for a clean UI terminal signal, and exits the loop gracefully.如果恢复尝试失败(空响应、网络错误等),catch 会弹出悬空的恢复消息,发出合成的
STOPchunk 作为 UI 终止信号,然后优雅退出循环。3. Stronger truncation guidance (fallback) / 加强截断引导(兜底)
If recovery is exhausted or skipped, the tool scheduler's truncation rejection uses more directive language:
恢复耗尽或跳过后,工具调度器的截断拒绝消息改用更明确的指令:
Bug fixes / 附带修复
toolCallRequestsinuseGeminiStream.tswas never cleared on Retry events, causing tool calls from a truncated attempt to persist alongside the retried attempt's calls. Now cleared on every RETRY. /toolCallRequests在 Retry 事件时从未清空,导致截断尝试的工具调用与重试的调用并存。现已在每次 RETRY 时清空。geminiMessageBuffer/thoughtBufferwere not reset on escalation RETRY, causing the 8K partial text to be concatenated with the full escalated response. Now reset whenisContinuationis false. / 提升 RETRY 时 buffers 未重置,导致 8K 部分文本与完整提升响应拼接。现在当isContinuation为 false 时重置。pendingHistoryItemcausedhandleContentEventto see a null pending item, create a fresh one, and reset the buffer to just the new chunk. Now the pending item is kept on continuation so the continuation text appends correctly. / 续写 RETRY 时清空pendingHistoryItem导致handleContentEvent看到 null 后创建新 item 并把 buffer 重置为仅新 chunk。现在续写时保留 pending item,续写文本正确追加。Test plan / 测试计划
E2E trace observed / E2E 追踪结果
UI sequence:
write_filefunctionCallfunctionCall→ skips recoverywrite_filewith a 21-line skeleton containingPLACEHOLDER - CSS will be injected/PLACEHOLDER - JS will be injectedmarkers — exactly the behavior the new guidance is designed to produceThis confirms all three defense layers work end-to-end with a real model, and that tanzhenxin's review fix #2 (functionCall guard) correctly prevents the invalid API sequence in practice.
此测试证实三层防御在真实模型调用下端到端工作,且 tanzhenxin 指出的 Issue 2 修复(functionCall guard)在实际场景下正确防止了无效的 API 序列。