Skip to content

🐛 fix(gateway): clean up paused server op after human approve/reject#13860

Merged
arvinxx merged 3 commits into
canaryfrom
fix/lobe-7152-gateway-op-cleanup
Apr 15, 2026
Merged

🐛 fix(gateway): clean up paused server op after human approve/reject#13860
arvinxx merged 3 commits into
canaryfrom
fix/lobe-7152-gateway-op-cleanup

Conversation

@arvinxx

@arvinxx arvinxx commented Apr 15, 2026

Copy link
Copy Markdown
Member

Summary

Follow-up fixes found while verifying LOBE-7152 on canary. In Gateway mode with userInterventionConfig.approvalMode='ask', the paused execServerAgentRuntime op was never released after the user interacted with the InterventionBar — the loading spinner kept spinning, and reject-only silently did nothing server-side.

Three changes, each independently load-bearing:

  • ToolAction.rejectToolCall now delegates to chatStore.rejectToolCalling. Previously the reject-only button only mutated local intervention state; the server's paused op never got the decision='rejected' signal and waited forever. Symmetric with rejectAndContinueToolCall, which already delegates.
  • AgentRuntimeCoordinator treats waiting_for_human as end-of-stream. request_human_approve flips state to waiting_for_human, so the coordinator now emits agent_runtime_end on that transition (renamed the predicate to hasEnteredStreamEndState for clarity). The paused state lives on server-side until a resume op arrives; the client's stream for the old operationId closes cleanly via the normal terminal-event path.
  • conversationControl adds #completeRunningServerOps as a client-side fallback. Called before executeGatewayAgent in the approve / reject / reject-continue Gateway branches. If the server-side agent_runtime_end is delayed or hasn't shipped yet, the client still clears the orphan op before the resume op's events start arriving.

Out of scope

  • decision='rejected' vs decision='rejected_continue' currently produce the same server behavior (both feed the rejection to the LLM as user_input and let it produce an acknowledgement). Arvin is taking that server-side halt split in a separate PR.

Test plan

  • bunx vitest run 'action.test.ts' 'conversationControl.test.ts' 'AgentRuntimeCoordinator.test.ts' — 62 tests pass, including new coverage for reject delegation and the waiting_for_human stream-end transition
  • Manual three-brothers e2e against latest canary (Electron + Gateway mode): approve / reject-only / reject-and-continue all end with running=0, no orphan ops pile up across sessions

🤖 Generated with Claude Code

In Gateway mode with userInterventionConfig.approvalMode='ask', the
paused execServerAgentRuntime op was never released — the loading
spinner kept spinning after the user approved, rejected, or
reject-and-continued, and reject-only silently did nothing on the
server.

- ToolAction.rejectToolCall now delegates to chatStore.rejectToolCalling
  so the Gateway resume op actually fires with decision='rejected';
  previously it only mutated local intervention state and the server's
  paused op waited forever.
- AgentRuntimeCoordinator treats waiting_for_human as end-of-stream so
  the coordinator emits agent_runtime_end when request_human_approve
  flips state, letting the client close the paused op via the normal
  terminal-event path.
- conversationControl adds #completeRunningServerOps as a fallback
  guard in the approve/reject/reject-continue Gateway branches — if
  the server-side signal is delayed or missing, the client still clears
  the orphan op before starting the resume op.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Apr 15, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
lobehub Ready Ready Preview, Comment Apr 15, 2026 5:38pm

Request Review

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @arvinxx, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@codecov

codecov Bot commented Apr 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.76%. Comparing base (2cf65e9) to head (a556e75).
⚠️ Report is 1 commits behind head on canary.

Additional details and impacted files
@@            Coverage Diff             @@
##           canary   #13860      +/-   ##
==========================================
+ Coverage   66.75%   66.76%   +0.01%     
==========================================
  Files        2044     2044              
  Lines      174331   174341      +10     
  Branches    20482    20488       +6     
==========================================
+ Hits       116373   116400      +27     
+ Misses      57834    57817      -17     
  Partials      124      124              
Flag Coverage Δ
app 59.08% <100.00%> (+0.01%) ⬆️
database 92.42% <ø> (ø)
packages/agent-runtime 79.72% <ø> (ø)
packages/context-engine 83.22% <ø> (ø)
packages/conversation-flow 92.36% <ø> (ø)
packages/file-loaders 87.02% <ø> (ø)
packages/memory-user-memory 74.74% <ø> (ø)
packages/model-bank 99.86% <ø> (ø)
packages/model-runtime 84.20% <ø> (ø)
packages/prompts 69.24% <ø> (ø)
packages/python-interpreter 92.90% <ø> (ø)
packages/ssrf-safe-fetch 0.00% <ø> (ø)
packages/utils 90.34% <ø> (ø)
packages/web-crawler 88.66% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Store 66.11% <100.00%> (+0.03%) ⬆️
Services 52.13% <ø> (ø)
Server 66.76% <100.00%> (+<0.01%) ⬆️
Libs 52.89% <ø> (ø)
Utils 91.12% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1b9a21fb21

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// `chatStore.rejectToolCalling` does its own tool-message existence guard, so the
// lookup that used to live here is redundant.
const chatStore = useChatStore.getState();
await chatStore.rejectToolCalling(toolMessageId, reason, context);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid double-dispatching reject before reject-continue

rejectAndContinueToolCall still invokes rejectToolCall first, but this change makes rejectToolCall call chatStore.rejectToolCalling directly. That means a single “reject and continue” click now issues a full reject flow before rejectAndContinueToolCalling runs, which in Gateway mode can start two resume ops (rejected then rejected_continue) and cause duplicate/racing server-side handling and messages; in client mode it also duplicates reject operation bookkeeping.

Useful? React with 👍 / 👎.

Comment thread src/store/chat/slices/aiChat/actions/conversationControl.ts Outdated
arvinxx and others added 2 commits April 16, 2026 01:25
If `executeGatewayAgent` failed (transient network/auth/server error),
the paused `execServerAgentRuntime` op was already marked completed
locally by the pre-call `#completeRunningServerOps`. Retries would
then see no running server op, miss `#hasRunningServerOp`, and fall
through to the non-Gateway client-mode path — while the backend was
still paused awaiting human input.

Snapshot the paused op IDs before the resume call and retire them
only inside the try block after `executeGatewayAgent` resolves. On
failure the running marker stays intact so a retry still lands on
the Gateway branch and can re-issue the resume.

The helper was renamed from `#completeRunningServerOps(context)` to
`#completeOpsById(ids)` to reflect the new contract: callers must
snapshot beforehand, not re-query at completion time (which would
incorrectly match the new resume op too).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Now that `rejectToolCall` delegates to `chatStore.rejectToolCalling`,
the chained `await get().rejectToolCall(...)` inside
`rejectAndContinueToolCall` fired a full halting reject before the
continue call. In Gateway mode that meant two resume ops on the same
tool_call_id (`decision='rejected'` followed by
`decision='rejected_continue'`) racing server-side; in client mode it
duplicated reject bookkeeping that `chatStore.rejectAndContinueToolCalling`
already handles internally.

Drop the chained call and fire `onToolRejected` inline so hook
semantics are preserved. `chatStore.rejectAndContinueToolCalling` is
now the single entry point for both the rejection persist and the
continue dispatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@arvinxx arvinxx merged commit 1005f44 into canary Apr 15, 2026
33 of 34 checks passed
@arvinxx arvinxx deleted the fix/lobe-7152-gateway-op-cleanup branch April 15, 2026 17:43
arvinxx added a commit that referenced this pull request Apr 15, 2026
…er op state

After the coordinator fix for `waiting_for_human` (#13860) the paused
`execServerAgentRuntime` op is marked `completed` client-side as soon
as the server emits `agent_runtime_end`. `startOperation` then runs
`cleanupCompletedOperations(30_000)`, which deletes any op completed
more than 30 seconds ago — so by the time the user sees the
InterventionBar and clicks approve/reject, the running (or recently
completed) server op is gone.

The previous `#hasRunningServerOp` check therefore kept returning
false against a live Gateway backend, flipping approve/reject into
the client-mode `internal_execAgentRuntime` branch and stranding the
server-side paused conversation.

Switch the helper to `#shouldUseGatewayResume`, which checks the same
`isGatewayModeEnabled()` lab flag used to route the initial send. The
signal now mirrors how the conversation was dispatched and survives
the op-cleanup window.

New regression test exercises the post-coordinator-fix state: the
paused `execServerAgentRuntime` op is explicitly `completed` before
the approve call runs, and we still expect the Gateway branch to
fire with `decision='approved'`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
arvinxx added a commit that referenced this pull request Apr 16, 2026
🐛 fix(gateway): route approve/reject via lab flag, not transient server op state

After the coordinator fix for `waiting_for_human` (#13860) the paused
`execServerAgentRuntime` op is marked `completed` client-side as soon
as the server emits `agent_runtime_end`. `startOperation` then runs
`cleanupCompletedOperations(30_000)`, which deletes any op completed
more than 30 seconds ago — so by the time the user sees the
InterventionBar and clicks approve/reject, the running (or recently
completed) server op is gone.

The previous `#hasRunningServerOp` check therefore kept returning
false against a live Gateway backend, flipping approve/reject into
the client-mode `internal_execAgentRuntime` branch and stranding the
server-side paused conversation.

Switch the helper to `#shouldUseGatewayResume`, which checks the same
`isGatewayModeEnabled()` lab flag used to route the initial send. The
signal now mirrors how the conversation was dispatched and survives
the op-cleanup window.

New regression test exercises the post-coordinator-fix state: the
paused `execServerAgentRuntime` op is explicitly `completed` before
the approve call runs, and we still expect the Gateway branch to
fire with `decision='approved'`.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant