🐛 fix(gateway): clean up paused server op after human approve/reject#13860
Conversation
In Gateway mode with userInterventionConfig.approvalMode='ask', the paused execServerAgentRuntime op was never released — the loading spinner kept spinning after the user approved, rejected, or reject-and-continued, and reject-only silently did nothing on the server. - ToolAction.rejectToolCall now delegates to chatStore.rejectToolCalling so the Gateway resume op actually fires with decision='rejected'; previously it only mutated local intervention state and the server's paused op waited forever. - AgentRuntimeCoordinator treats waiting_for_human as end-of-stream so the coordinator emits agent_runtime_end when request_human_approve flips state, letting the client close the paused op via the normal terminal-event path. - conversationControl adds #completeRunningServerOps as a fallback guard in the approve/reject/reject-continue Gateway branches — if the server-side signal is delayed or missing, the client still clears the orphan op before starting the resume op. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## canary #13860 +/- ##
==========================================
+ Coverage 66.75% 66.76% +0.01%
==========================================
Files 2044 2044
Lines 174331 174341 +10
Branches 20482 20488 +6
==========================================
+ Hits 116373 116400 +27
+ Misses 57834 57817 -17
Partials 124 124
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1b9a21fb21
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // `chatStore.rejectToolCalling` does its own tool-message existence guard, so the | ||
| // lookup that used to live here is redundant. | ||
| const chatStore = useChatStore.getState(); | ||
| await chatStore.rejectToolCalling(toolMessageId, reason, context); |
There was a problem hiding this comment.
Avoid double-dispatching reject before reject-continue
rejectAndContinueToolCall still invokes rejectToolCall first, but this change makes rejectToolCall call chatStore.rejectToolCalling directly. That means a single “reject and continue” click now issues a full reject flow before rejectAndContinueToolCalling runs, which in Gateway mode can start two resume ops (rejected then rejected_continue) and cause duplicate/racing server-side handling and messages; in client mode it also duplicates reject operation bookkeeping.
Useful? React with 👍 / 👎.
If `executeGatewayAgent` failed (transient network/auth/server error), the paused `execServerAgentRuntime` op was already marked completed locally by the pre-call `#completeRunningServerOps`. Retries would then see no running server op, miss `#hasRunningServerOp`, and fall through to the non-Gateway client-mode path — while the backend was still paused awaiting human input. Snapshot the paused op IDs before the resume call and retire them only inside the try block after `executeGatewayAgent` resolves. On failure the running marker stays intact so a retry still lands on the Gateway branch and can re-issue the resume. The helper was renamed from `#completeRunningServerOps(context)` to `#completeOpsById(ids)` to reflect the new contract: callers must snapshot beforehand, not re-query at completion time (which would incorrectly match the new resume op too). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Now that `rejectToolCall` delegates to `chatStore.rejectToolCalling`, the chained `await get().rejectToolCall(...)` inside `rejectAndContinueToolCall` fired a full halting reject before the continue call. In Gateway mode that meant two resume ops on the same tool_call_id (`decision='rejected'` followed by `decision='rejected_continue'`) racing server-side; in client mode it duplicated reject bookkeeping that `chatStore.rejectAndContinueToolCalling` already handles internally. Drop the chained call and fire `onToolRejected` inline so hook semantics are preserved. `chatStore.rejectAndContinueToolCalling` is now the single entry point for both the rejection persist and the continue dispatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…er op state After the coordinator fix for `waiting_for_human` (#13860) the paused `execServerAgentRuntime` op is marked `completed` client-side as soon as the server emits `agent_runtime_end`. `startOperation` then runs `cleanupCompletedOperations(30_000)`, which deletes any op completed more than 30 seconds ago — so by the time the user sees the InterventionBar and clicks approve/reject, the running (or recently completed) server op is gone. The previous `#hasRunningServerOp` check therefore kept returning false against a live Gateway backend, flipping approve/reject into the client-mode `internal_execAgentRuntime` branch and stranding the server-side paused conversation. Switch the helper to `#shouldUseGatewayResume`, which checks the same `isGatewayModeEnabled()` lab flag used to route the initial send. The signal now mirrors how the conversation was dispatched and survives the op-cleanup window. New regression test exercises the post-coordinator-fix state: the paused `execServerAgentRuntime` op is explicitly `completed` before the approve call runs, and we still expect the Gateway branch to fire with `decision='approved'`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
🐛 fix(gateway): route approve/reject via lab flag, not transient server op state After the coordinator fix for `waiting_for_human` (#13860) the paused `execServerAgentRuntime` op is marked `completed` client-side as soon as the server emits `agent_runtime_end`. `startOperation` then runs `cleanupCompletedOperations(30_000)`, which deletes any op completed more than 30 seconds ago — so by the time the user sees the InterventionBar and clicks approve/reject, the running (or recently completed) server op is gone. The previous `#hasRunningServerOp` check therefore kept returning false against a live Gateway backend, flipping approve/reject into the client-mode `internal_execAgentRuntime` branch and stranding the server-side paused conversation. Switch the helper to `#shouldUseGatewayResume`, which checks the same `isGatewayModeEnabled()` lab flag used to route the initial send. The signal now mirrors how the conversation was dispatched and survives the op-cleanup window. New regression test exercises the post-coordinator-fix state: the paused `execServerAgentRuntime` op is explicitly `completed` before the approve call runs, and we still expect the Gateway branch to fire with `decision='approved'`. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Follow-up fixes found while verifying LOBE-7152 on canary. In Gateway mode with
userInterventionConfig.approvalMode='ask', the pausedexecServerAgentRuntimeop was never released after the user interacted with the InterventionBar — the loading spinner kept spinning, and reject-only silently did nothing server-side.Three changes, each independently load-bearing:
ToolAction.rejectToolCallnow delegates tochatStore.rejectToolCalling. Previously the reject-only button only mutated local intervention state; the server's paused op never got thedecision='rejected'signal and waited forever. Symmetric withrejectAndContinueToolCall, which already delegates.AgentRuntimeCoordinatortreatswaiting_for_humanas end-of-stream.request_human_approveflips state towaiting_for_human, so the coordinator now emitsagent_runtime_endon that transition (renamed the predicate tohasEnteredStreamEndStatefor clarity). The paused state lives on server-side until a resume op arrives; the client's stream for the old operationId closes cleanly via the normal terminal-event path.conversationControladds#completeRunningServerOpsas a client-side fallback. Called beforeexecuteGatewayAgentin the approve / reject / reject-continue Gateway branches. If the server-sideagent_runtime_endis delayed or hasn't shipped yet, the client still clears the orphan op before the resume op's events start arriving.Out of scope
decision='rejected'vsdecision='rejected_continue'currently produce the same server behavior (both feed the rejection to the LLM asuser_inputand let it produce an acknowledgement). Arvin is taking that server-side halt split in a separate PR.Test plan
bunx vitest run 'action.test.ts' 'conversationControl.test.ts' 'AgentRuntimeCoordinator.test.ts'— 62 tests pass, including new coverage for reject delegation and thewaiting_for_humanstream-end transitionrunning=0, no orphan ops pile up across sessions🤖 Generated with Claude Code