fix(gateway): allow chat.abort to stop agent RPC runs by bitloi · Pull Request #71214 · openclaw/openclaw

bitloi · 2026-04-24T18:42:27Z

Summary

Problem: chat.abort and sessions.abort return {ok: true, aborted: false, runIds: []} for runs started via the public agent RPC, because only chat.send registers its run into context.chatAbortControllers. External apps that drive the Gateway through agent (to get raw lifecycle/tool/approval events or per-run system prompt injection) have no public way to stop an in-flight run.
Why it matters: safe deployments need a reliable public interrupt for autonomous agent runs; the only current workaround is depending on OpenClaw internals.
What changed: agent RPC now creates an AbortController, registers it into chatAbortControllers (matching chat.send's entry shape), threads the signal into agentCommandFromIngress, and clears the entry on completion — guarded so it only deletes entries it owns.
What did NOT change (scope boundary): no protocol changes, no new RPC (reporter's preferred sessions.stop left for a separate design), no change to chat.send / sessions.abort behavior, no change to embedded runner abort plumbing (it already honors abortSignal).

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes [Bug]: chat.abort cannot stop active agent runs; only works for chat.send runs #71128
Related #
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: dispatchAgentRunFromGateway called agentCommandFromIngress without creating an AbortController or registering into context.chatAbortControllers. chat.abort's lookup (chatAbortControllers.get(runId)) therefore always missed for agent-started runs and returned aborted:false, even while the run was streaming.
Missing detection / guardrail: no unit test asserted that a run started by the agent RPC could be interrupted by chat.abort; all existing abort coverage went through chat.send.
Contributing context: chat.send and agent share the concept of a runId / idempotencyKey, and agent.wait already reads chatAbortControllers.has(runId) to detect cross-RPC collisions — so the registry was intended to be populated by both paths; the agent handler simply never wrote to it.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/gateway/server-methods/agent.test.ts (gateway agent handler chat.abort integration describe block).
Scenario the test should lock in: an agent-started run is abortable via chat.abort both by {sessionKey, runId} and by {sessionKey} alone; the resulting AbortSignal fires; the controller entry is removed after success, error, and abort; the registration does not overwrite or evict an entry owned by a concurrent chat.send with the same runId.
Why this is the smallest reliable guardrail: the bug is at the gateway handler seam; asserting registration + cleanup + signal wiring at the handler level is enough and avoids booting the full embedded runner.
Existing test that already covers this: none for the agent path. chat.abort coverage existed only via chat.send.
If no new test is added, why not: N/A — 7 new tests added.

User-visible / Behavior Changes

chat.abort({sessionKey, runId}) and sessions.abort({key, runId}) now interrupt in-flight runs started by the agent RPC, returning aborted:true, runIds:[runId]. Before, they returned aborted:false, runIds:[].
No config, default, or schema changes.

Diagram (if applicable)

Before:
  agent RPC -> runId=X -> dispatch(no abortSignal)
  chat.abort(X)    -> chatAbortControllers.get(X)   -> undefined -> aborted:false

After:
  agent RPC -> runId=X -> register AbortController in chatAbortControllers
                       -> dispatch(ingressOpts.abortSignal = controller.signal)
  chat.abort(X)    -> chatAbortControllers.get(X)   -> entry -> controller.abort()
                   -> run aborts, .finally() clears entry (only if entry.controller === ours)

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No — only adds an interrupt path for runs the caller already started.
Data access scope changed? No — chat.abort's existing authorization (canRequesterAbortChatRun, admin-or-matching-connId-or-deviceId) is preserved because the registered entry populates ownerConnId / ownerDeviceId from the originating client, same as chat.send.
If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

OS: macOS 15.4 (reporter); reproduced locally against HEAD via unit tests on Linux.
Runtime/container: Node 22+, pnpm openclaw gateway.
Model/provider: any (reporter: minimax; reproduction is model-agnostic).
Integration/channel (if any): N/A — affects all direct Gateway RPC clients.
Relevant config (redacted): default gateway with token auth.

Steps

Start a long-running run via agent RPC: {method: "agent", params: {sessionKey, message, idempotencyKey, ...}}.
While the run is streaming, call chat.abort({sessionKey, runId}) or sessions.abort({key: sessionKey, runId}).
Inspect the RPC response and the subsequent agent lifecycle.

Expected

RPC returns {ok: true, aborted: true, runIds: [runId]} and the agent run terminates.

Actual (before this PR)

RPC returns {ok: true, aborted: false, runIds: []}; the run continues.

Evidence

Failing test/log before + passing after

Verified by temporarily reverting each of (a) the registration block and (b) the cleanup guard, and confirming the new tests fail, then restoring and confirming they pass. Representative failure on the unfixed source:

FAIL  gateway agent handler chat.abort integration > chat.abort by runId aborts the agent run's signal and removes the entry
  expected: { aborted: true, runIds: ["idem-abort-run"] }
  received: { ok: true, aborted: false, runIds: [] }

After the fix, all 40 tests in agent.test.ts pass, plus the full abort-adjacent suite (106/106 across 5 files).

Human Verification (required)

Verified scenarios:
- agent run → chat.abort({sessionKey, runId}) → signal fires, entry removed, response reports aborted:true.
- agent run → chat.abort({sessionKey}) with no runId → session-scoped abort finds and stops the run.
- agent run completes normally → entry removed in .finally().
- agent run rejects → entry removed in .finally().
- agent run with no resolvedSessionKey → registration skipped; no map pollution.
- Pre-existing chat.send entry with same runId → agent registration skipped; agent completion does NOT evict chat.send's entry.
- Registration happens before the accepted ack (race-safety against clients that abort immediately on seeing the ack).
Edge cases checked: owner authorization (ownerConnId / ownerDeviceId populated so canRequesterAbortChatRun respects device scoping); controller-identity guard on cleanup so concurrent chat.send + agent with the same idempotencyKey do not corrupt each other's registration.
What I did not verify: live end-to-end abort with a real upstream model call (coverage is at the gateway handler seam; the embedded runner already honors abortSignal via preexisting tests, and I did not re-run the full integration suite against a live provider).

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No
If yes, exact upgrade steps: N/A

Risks and Mitigations

Risk: runId collision between a concurrent chat.send and agent RPC (same idempotencyKey, same session) could in principle cause double-registration or premature cleanup.
- Mitigation: registration is skipped when chatAbortControllers.has(runId) is already true, and cleanup only deletes entries whose controller matches the one this handler created (explicit test locks this in).
Risk: agent RPC runs without a resolvedSessionKey (e.g., group-only runs) can't be aborted by chat.abort because chat.abort requires a sessionKey.
- Mitigation: accepted — matches chat.send's contract; the alternative would require a new session-less abort surface, which is out of scope for this fix (reporter's preferred sessions.stop path).

Register agent-RPC runs into chatAbortControllers and thread the resulting AbortSignal into agentCommandFromIngress so chat.abort and sessions.abort can interrupt runs started via the public `agent` method, matching the behavior chat.send already provides. Registration happens before the accepted ack so a client that aborts immediately after seeing the ack cannot race the map entry, and cleanup runs in the dispatcher's .finally() regardless of success or error. Fixes openclaw#71128.

Guard dispatchAgentRunFromGateway's completion cleanup so it only deletes the chatAbortControllers entry it created. Prevents a concurrent chat.send run with the same runId from losing its abort controller when the agent RPC run finishes.

greptile-apps · 2026-04-24T18:45:31Z

Greptile Summary

This PR fixes chat.abort and sessions.abort not being able to interrupt in-flight runs started via the public agent RPC. It registers an AbortController into chatAbortControllers before sending the accepted ack, threads the signal into agentCommandFromIngress, and cleans it up in a .finally() guarded by controller identity to avoid evicting a concurrent chat.send entry with the same runId.

The expiresAtMs stored in the chatAbortControllers entry is capped at 24 hours by resolveChatRunExpiresAtMs (its maxMs default), but the default agent timeout is 48 hours and operators can set timeout=0 for indefinite runs. Before this PR, agent runs had no entry in the map so server maintenance never touched them; after this PR, server-maintenance.ts will call abortChatRunById(…, { stopReason: \"timeout\" }) on any agent run alive for > 24 hours, silently truncating configured timeouts.

Confidence Score: 4/5

Safe to merge for most deployments; long-running agent runs (> 24h) will be force-aborted by server maintenance earlier than their configured timeout.

The core fix is correct and well-tested. One P1 issue: agent runs registered with expiresAtMs capped at 24h will be cut short by server maintenance even when the configured timeout is 48h or infinite, which is a behavioral regression for long-running autonomous agent runs.

src/gateway/server-methods/agent.ts — the expiresAtMs computation at line 896 needs to be aligned with the actual agent timeout to avoid premature maintenance-driven aborts.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/gateway/server-methods/agent.ts
Line: 887-896

Comment:
**Maintenance timeout shorter than agent timeout for long-running runs**

`resolveChatRunExpiresAtMs` hard-caps `expiresAtMs` at `now + 24h` (its `maxMs` default), but `resolveAgentTimeoutMs` defaults to 48 hours and can return up to ~24.8 days for `timeout=0`. Before this PR, agent runs had no entry in `chatAbortControllers`, so the server maintenance loop (`server-maintenance.ts:109`, `abortChatRunById(…, { stopReason: "timeout" })`) never touched them. Now any agent run that has been going for more than 24 hours will be force-aborted by maintenance even if the operator configured a 48-hour or indefinite timeout — a silent regression for long-running autonomous agents. Consider passing `maxMs: timeoutMs + graceMs` to `resolveChatRunExpiresAtMs`, or capping `expiresAtMs` only at `MAX_SAFE_TIMEOUT_MS`, so the maintenance expiry matches the configured agent timeout rather than being silently truncated to 24 hours.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(gateway): scope agent abort cleanup ..." | Re-trigger Greptile}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3fe11ab58a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…cise Add an optional `kind` discriminator to ChatAbortControllerEntry and mark agent-RPC registrations as `kind: "agent"`. `isChatSendRunActive` in the event subscription layer and `hasActiveChatRun` in the agent.wait handler now check `kind !== "agent"` so they keep meaning 'chat.send specifically is active' — otherwise agent.wait would ignore the agent run in flight and return a stale chat.send terminal snapshot (gateway-server-chat regression caught on CI).

resolveChatRunExpiresAtMs defaults to a 24h maxMs cap that made sense for chat.send but silently truncates long-running agent RPC runs (48h default, ~24.8d for timeout=0). Pass an explicit maxMs of timeoutMs + graceMs so the maintenance sweep stops force-aborting long autonomous agents at 24h.

steipete · 2026-04-24T19:16:21Z

Landed via squash merge onto main.

Local focused tests: pnpm test src/gateway/server-methods/agent.test.ts src/gateway/chat-abort.test.ts src/gateway/server-methods/chat.abort-authorization.test.ts src/gateway/server-methods/chat.abort-persistence.test.ts src/gateway/server-maintenance.test.ts (5 files, 62 tests)
Local changed gate: conflict markers, core typecheck, and core test typecheck passed; local lint hit an unrelated existing doctor lint outside this PR
CI: green on final PR SHA 22474b77f017fd6e7839e81425dec6ce660fe854
Source SHA: 22474b7
Merge commit: 8cae2ed

Thanks @bitloi!

bitloi · 2026-04-24T19:16:59Z

Thanks for your review.

bitloi added 2 commits April 24, 2026 20:11

openclaw-barnacle Bot added gateway Gateway runtime size: M labels Apr 24, 2026

greptile-apps Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread src/gateway/server-methods/agent.ts Outdated

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread src/gateway/server-methods/agent.ts Outdated

Comment thread src/gateway/server-methods/agent.ts Outdated

bitloi and others added 3 commits April 24, 2026 20:54

fix: harden agent RPC abort registration

71be50a

openclaw-barnacle Bot added app: web-ui App: web-ui size: L and removed size: M labels Apr 24, 2026

docs: credit agent RPC abort fix

22474b7

steipete merged commit 8cae2ed into openclaw:main Apr 24, 2026
67 checks passed

This was referenced Apr 29, 2026

[Bug]: openclaw-agent ignores SIGTERM under cron, accumulates hung process chains and exhausts host RAM/swap #71710

Open

fix(control-ui): preserve Stop after reconnect #76111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gateway): allow chat.abort to stop agent RPC runs#71214

fix(gateway): allow chat.abort to stop agent RPC runs#71214
steipete merged 6 commits intoopenclaw:mainfrom
bitloi:fix/issue-71128-agent-rpc-chat-abort

bitloi commented Apr 24, 2026

Uh oh!

greptile-apps Bot commented Apr 24, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steipete commented Apr 24, 2026

Uh oh!

bitloi commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

bitloi commented Apr 24, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual (before this PR)

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Uh oh!

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steipete commented Apr 24, 2026

Uh oh!

bitloi commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants