fix(gateway): allow chat.abort to stop agent RPC runs#71214
fix(gateway): allow chat.abort to stop agent RPC runs#71214steipete merged 6 commits intoopenclaw:mainfrom
Conversation
Register agent-RPC runs into chatAbortControllers and thread the resulting AbortSignal into agentCommandFromIngress so chat.abort and sessions.abort can interrupt runs started via the public `agent` method, matching the behavior chat.send already provides. Registration happens before the accepted ack so a client that aborts immediately after seeing the ack cannot race the map entry, and cleanup runs in the dispatcher's .finally() regardless of success or error. Fixes openclaw#71128.
Guard dispatchAgentRunFromGateway's completion cleanup so it only deletes the chatAbortControllers entry it created. Prevents a concurrent chat.send run with the same runId from losing its abort controller when the agent RPC run finishes.
Greptile SummaryThis PR fixes
Confidence Score: 4/5Safe to merge for most deployments; long-running agent runs (> 24h) will be force-aborted by server maintenance earlier than their configured timeout. The core fix is correct and well-tested. One P1 issue: agent runs registered with expiresAtMs capped at 24h will be cut short by server maintenance even when the configured timeout is 48h or infinite, which is a behavioral regression for long-running autonomous agent runs. src/gateway/server-methods/agent.ts — the expiresAtMs computation at line 896 needs to be aligned with the actual agent timeout to avoid premature maintenance-driven aborts. Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/gateway/server-methods/agent.ts
Line: 887-896
Comment:
**Maintenance timeout shorter than agent timeout for long-running runs**
`resolveChatRunExpiresAtMs` hard-caps `expiresAtMs` at `now + 24h` (its `maxMs` default), but `resolveAgentTimeoutMs` defaults to 48 hours and can return up to ~24.8 days for `timeout=0`. Before this PR, agent runs had no entry in `chatAbortControllers`, so the server maintenance loop (`server-maintenance.ts:109`, `abortChatRunById(…, { stopReason: "timeout" })`) never touched them. Now any agent run that has been going for more than 24 hours will be force-aborted by maintenance even if the operator configured a 48-hour or indefinite timeout — a silent regression for long-running autonomous agents. Consider passing `maxMs: timeoutMs + graceMs` to `resolveChatRunExpiresAtMs`, or capping `expiresAtMs` only at `MAX_SAFE_TIMEOUT_MS`, so the maintenance expiry matches the configured agent timeout rather than being silently truncated to 24 hours.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(gateway): scope agent abort cleanup ..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3fe11ab58a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…cise Add an optional `kind` discriminator to ChatAbortControllerEntry and mark agent-RPC registrations as `kind: "agent"`. `isChatSendRunActive` in the event subscription layer and `hasActiveChatRun` in the agent.wait handler now check `kind !== "agent"` so they keep meaning 'chat.send specifically is active' — otherwise agent.wait would ignore the agent run in flight and return a stale chat.send terminal snapshot (gateway-server-chat regression caught on CI).
resolveChatRunExpiresAtMs defaults to a 24h maxMs cap that made sense for chat.send but silently truncates long-running agent RPC runs (48h default, ~24.8d for timeout=0). Pass an explicit maxMs of timeoutMs + graceMs so the maintenance sweep stops force-aborting long autonomous agents at 24h.
|
Landed via squash merge onto main.
Thanks @bitloi! |
|
Thanks for your review. |
Summary
chat.abortandsessions.abortreturn{ok: true, aborted: false, runIds: []}for runs started via the publicagentRPC, because onlychat.sendregisters its run intocontext.chatAbortControllers. External apps that drive the Gateway throughagent(to get raw lifecycle/tool/approval events or per-run system prompt injection) have no public way to stop an in-flight run.agentRPC now creates anAbortController, registers it intochatAbortControllers(matchingchat.send's entry shape), threads the signal intoagentCommandFromIngress, and clears the entry on completion — guarded so it only deletes entries it owns.sessions.stopleft for a separate design), no change tochat.send/sessions.abortbehavior, no change to embedded runner abort plumbing (it already honorsabortSignal).Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Root Cause (if applicable)
dispatchAgentRunFromGatewaycalledagentCommandFromIngresswithout creating anAbortControlleror registering intocontext.chatAbortControllers.chat.abort's lookup (chatAbortControllers.get(runId)) therefore always missed for agent-started runs and returnedaborted:false, even while the run was streaming.agentRPC could be interrupted bychat.abort; all existing abort coverage went throughchat.send.chat.sendandagentshare the concept of arunId/idempotencyKey, andagent.waitalready readschatAbortControllers.has(runId)to detect cross-RPC collisions — so the registry was intended to be populated by both paths; theagenthandler simply never wrote to it.Regression Test Plan (if applicable)
src/gateway/server-methods/agent.test.ts(gateway agent handler chat.abort integrationdescribe block).agent-started run is abortable viachat.abortboth by{sessionKey, runId}and by{sessionKey}alone; the resultingAbortSignalfires; the controller entry is removed after success, error, and abort; the registration does not overwrite or evict an entry owned by a concurrentchat.sendwith the same runId.agentpath.chat.abortcoverage existed only viachat.send.User-visible / Behavior Changes
chat.abort({sessionKey, runId})andsessions.abort({key, runId})now interrupt in-flight runs started by theagentRPC, returningaborted:true, runIds:[runId]. Before, they returnedaborted:false, runIds:[].Diagram (if applicable)
Security Impact (required)
NoNoNoNo— only adds an interrupt path for runs the caller already started.No—chat.abort's existing authorization (canRequesterAbortChatRun, admin-or-matching-connId-or-deviceId) is preserved because the registered entry populatesownerConnId/ownerDeviceIdfrom the originating client, same aschat.send.Yes, explain risk + mitigation: N/ARepro + Verification
Environment
pnpm openclaw gateway.minimax; reproduction is model-agnostic).Steps
agentRPC:{method: "agent", params: {sessionKey, message, idempotencyKey, ...}}.chat.abort({sessionKey, runId})orsessions.abort({key: sessionKey, runId}).Expected
{ok: true, aborted: true, runIds: [runId]}and the agent run terminates.Actual (before this PR)
{ok: true, aborted: false, runIds: []}; the run continues.Evidence
Verified by temporarily reverting each of (a) the registration block and (b) the cleanup guard, and confirming the new tests fail, then restoring and confirming they pass. Representative failure on the unfixed source:
After the fix, all 40 tests in
agent.test.tspass, plus the full abort-adjacent suite (106/106 across 5 files).Human Verification (required)
agentrun →chat.abort({sessionKey, runId})→ signal fires, entry removed, response reportsaborted:true.agentrun →chat.abort({sessionKey})with no runId → session-scoped abort finds and stops the run.agentrun completes normally → entry removed in.finally().agentrun rejects → entry removed in.finally().agentrun with noresolvedSessionKey→ registration skipped; no map pollution.chat.sendentry with samerunId→ agent registration skipped; agent completion does NOT evict chat.send's entry.ownerConnId/ownerDeviceIdpopulated socanRequesterAbortChatRunrespects device scoping); controller-identity guard on cleanup so concurrent chat.send + agent with the same idempotencyKey do not corrupt each other's registration.abortSignalvia preexisting tests, and I did not re-run the full integration suite against a live provider).Review Conversations
Compatibility / Migration
YesNoNoRisks and Mitigations
chat.sendandagentRPC (sameidempotencyKey, same session) could in principle cause double-registration or premature cleanup.chatAbortControllers.has(runId)is already true, and cleanup only deletes entries whosecontrollermatches the one this handler created (explicit test locks this in).resolvedSessionKey(e.g., group-only runs) can't be aborted bychat.abortbecause chat.abort requires a sessionKey.chat.send's contract; the alternative would require a new session-less abort surface, which is out of scope for this fix (reporter's preferredsessions.stoppath).