fix(subagent): re-pair tool-results on gateway-loop replay (AI_MissingToolResultsError)#1934
Open
thomaskong119 wants to merge 1 commit into
Open
Conversation
…gToolResultsError)
The gateway tool-loop persists tool executions to `subagent_tool_executions`,
not as `subagent_messages` rows. Reconstructing the conversation on replay with
a naive 1:1 map (`priorMessages.map(...)`) therefore yields assistant tool-call
turns with no following tool-result message. On replay the loop seeds its
`messages` from this reconstruction, and the next `generateText()` throws AI SDK
v6's `AI_MissingToolResultsError` ("Tool results are missing for tool calls …").
This is fatal for any non-Anthropic model routed through the gateway loop
(`agent.use_gateway_loop`) that emits PARALLEL tool calls: a single crashed turn
drops several tool-results at once, so every subagent retry permanently fails.
Seen in production on GLM (glm-coding) for cron/agent jobs the day after the
gateway loop took over (company-pulse + dream-synthesize subagents all died).
Fix: extract a pure `reconstructReplayMessages()` that re-inserts a tool-result
message after each prior assistant tool-call turn, pairing every call with its
persisted outcome (keyed by message_idx + provider tool_use_id). A call with no
completed execution (crash before the result was persisted) gets a synthesized
error result so the turn still validates and the model recovers on the next turn.
Counters (`nextMessageIdx`, `nextTurnIdx`, `initialMessages`) now derive from the
persisted-message count (`priorMessages`), not the inflated reconstruction.
The pre-existing crash-replay e2e stubs the chat transport, so it short-circuits
before `generateText` and never exercised the ModelMessage conversion — which is
why this slipped through. The new test drives the real AI SDK v6 `generateText`
(no network) and asserts the re-paired history validates while the naive map throws.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Every subagent routed through the gateway tool-loop (
agent.use_gateway_loop, i.e. any non-Anthropic provider) that emits parallel tool calls permanently fails on retry with:i.e. AI SDK v6's
AI_MissingToolResultsError. In production on GLM (glm-coding) this killed cron/agent jobs the day after the gateway loop took over — company-pulse and everydream-synthesize subagent that fanned out parallel tool calls died, and the retries compounded the failure.Root cause
The gateway loop persists tool executions to
subagent_tool_executions, not assubagent_messagesrows. On replay,runSubagentViaGatewayrebuilt the prior conversation with a naive 1:1 map:That yields assistant tool-call turns with no following tool-result message. The loop seeds
messagesfromreplayState.priorMessages, so the nextgenerateText()sees unanswered tool-calls and v6 rejects the prompt. A single crashed turn with N parallel calls drops all N tool-results at once.Fix
reconstructReplayMessages(priorMessages, priorToolExecs)that re-inserts a tool-result message after each prior assistant tool-call turn, pairing every call with its persisted outcome (keyed bymessage_idx+ providertool_use_id). A call with no completed execution (crash before the result was persisted) gets a synthesized error result so the turn still validates and the model recovers next turn.nextMessageIdx,nextTurnIdx, and theinitialMessagesseed-condition from the persisted-message count (priorMessages), notpriorChatMessages.length— the latter now also counts the re-inserted tool-result messages (which are not their ownsubagent_messagesrows), so the loop'smessageIdxcounter would otherwise skip.Provider-neutral; only reachable on the gateway path.
Why the existing tests didn't catch it
test/e2e/subagent-crash-replay-multi-provider.test.tsstubs the chat transport (__setChatTransportForTests), so it short-circuits beforegenerateTextruns and never exercised the ModelMessage conversion — the same blind spot called out intest/ai/gateway-tools-schema.test.ts.Testing
New
test/subagent-replay-toolresult-repair.test.ts(4 tests, no network):generateText(viaMockLanguageModelV3) and asserts the re-paired history validates while the naive 1:1 map throws.Regression:
subagent-crash-replay-multi-provider(13),gateway-tools-schema(3),gateway-tool-loop(7) all still pass.🤖 Generated with Claude Code