Skip to content

fix(gateway): allow chat.abort to stop agent RPC runs#71214

Merged
steipete merged 6 commits intoopenclaw:mainfrom
bitloi:fix/issue-71128-agent-rpc-chat-abort
Apr 24, 2026
Merged

fix(gateway): allow chat.abort to stop agent RPC runs#71214
steipete merged 6 commits intoopenclaw:mainfrom
bitloi:fix/issue-71128-agent-rpc-chat-abort

Conversation

@bitloi
Copy link
Copy Markdown
Contributor

@bitloi bitloi commented Apr 24, 2026

Summary

  • Problem: chat.abort and sessions.abort return {ok: true, aborted: false, runIds: []} for runs started via the public agent RPC, because only chat.send registers its run into context.chatAbortControllers. External apps that drive the Gateway through agent (to get raw lifecycle/tool/approval events or per-run system prompt injection) have no public way to stop an in-flight run.
  • Why it matters: safe deployments need a reliable public interrupt for autonomous agent runs; the only current workaround is depending on OpenClaw internals.
  • What changed: agent RPC now creates an AbortController, registers it into chatAbortControllers (matching chat.send's entry shape), threads the signal into agentCommandFromIngress, and clears the entry on completion — guarded so it only deletes entries it owns.
  • What did NOT change (scope boundary): no protocol changes, no new RPC (reporter's preferred sessions.stop left for a separate design), no change to chat.send / sessions.abort behavior, no change to embedded runner abort plumbing (it already honors abortSignal).

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause (if applicable)

  • Root cause: dispatchAgentRunFromGateway called agentCommandFromIngress without creating an AbortController or registering into context.chatAbortControllers. chat.abort's lookup (chatAbortControllers.get(runId)) therefore always missed for agent-started runs and returned aborted:false, even while the run was streaming.
  • Missing detection / guardrail: no unit test asserted that a run started by the agent RPC could be interrupted by chat.abort; all existing abort coverage went through chat.send.
  • Contributing context: chat.send and agent share the concept of a runId / idempotencyKey, and agent.wait already reads chatAbortControllers.has(runId) to detect cross-RPC collisions — so the registry was intended to be populated by both paths; the agent handler simply never wrote to it.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/gateway/server-methods/agent.test.ts (gateway agent handler chat.abort integration describe block).
  • Scenario the test should lock in: an agent-started run is abortable via chat.abort both by {sessionKey, runId} and by {sessionKey} alone; the resulting AbortSignal fires; the controller entry is removed after success, error, and abort; the registration does not overwrite or evict an entry owned by a concurrent chat.send with the same runId.
  • Why this is the smallest reliable guardrail: the bug is at the gateway handler seam; asserting registration + cleanup + signal wiring at the handler level is enough and avoids booting the full embedded runner.
  • Existing test that already covers this: none for the agent path. chat.abort coverage existed only via chat.send.
  • If no new test is added, why not: N/A — 7 new tests added.

User-visible / Behavior Changes

  • chat.abort({sessionKey, runId}) and sessions.abort({key, runId}) now interrupt in-flight runs started by the agent RPC, returning aborted:true, runIds:[runId]. Before, they returned aborted:false, runIds:[].
  • No config, default, or schema changes.

Diagram (if applicable)

Before:
  agent RPC -> runId=X -> dispatch(no abortSignal)
  chat.abort(X)    -> chatAbortControllers.get(X)   -> undefined -> aborted:false

After:
  agent RPC -> runId=X -> register AbortController in chatAbortControllers
                       -> dispatch(ingressOpts.abortSignal = controller.signal)
  chat.abort(X)    -> chatAbortControllers.get(X)   -> entry -> controller.abort()
                   -> run aborts, .finally() clears entry (only if entry.controller === ours)

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No — only adds an interrupt path for runs the caller already started.
  • Data access scope changed? Nochat.abort's existing authorization (canRequesterAbortChatRun, admin-or-matching-connId-or-deviceId) is preserved because the registered entry populates ownerConnId / ownerDeviceId from the originating client, same as chat.send.
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS 15.4 (reporter); reproduced locally against HEAD via unit tests on Linux.
  • Runtime/container: Node 22+, pnpm openclaw gateway.
  • Model/provider: any (reporter: minimax; reproduction is model-agnostic).
  • Integration/channel (if any): N/A — affects all direct Gateway RPC clients.
  • Relevant config (redacted): default gateway with token auth.

Steps

  1. Start a long-running run via agent RPC: {method: "agent", params: {sessionKey, message, idempotencyKey, ...}}.
  2. While the run is streaming, call chat.abort({sessionKey, runId}) or sessions.abort({key: sessionKey, runId}).
  3. Inspect the RPC response and the subsequent agent lifecycle.

Expected

  • RPC returns {ok: true, aborted: true, runIds: [runId]} and the agent run terminates.

Actual (before this PR)

  • RPC returns {ok: true, aborted: false, runIds: []}; the run continues.

Evidence

  • Failing test/log before + passing after

Verified by temporarily reverting each of (a) the registration block and (b) the cleanup guard, and confirming the new tests fail, then restoring and confirming they pass. Representative failure on the unfixed source:

FAIL  gateway agent handler chat.abort integration > chat.abort by runId aborts the agent run's signal and removes the entry
  expected: { aborted: true, runIds: ["idem-abort-run"] }
  received: { ok: true, aborted: false, runIds: [] }

After the fix, all 40 tests in agent.test.ts pass, plus the full abort-adjacent suite (106/106 across 5 files).

Human Verification (required)

  • Verified scenarios:
    • agent run → chat.abort({sessionKey, runId}) → signal fires, entry removed, response reports aborted:true.
    • agent run → chat.abort({sessionKey}) with no runId → session-scoped abort finds and stops the run.
    • agent run completes normally → entry removed in .finally().
    • agent run rejects → entry removed in .finally().
    • agent run with no resolvedSessionKey → registration skipped; no map pollution.
    • Pre-existing chat.send entry with same runId → agent registration skipped; agent completion does NOT evict chat.send's entry.
    • Registration happens before the accepted ack (race-safety against clients that abort immediately on seeing the ack).
  • Edge cases checked: owner authorization (ownerConnId / ownerDeviceId populated so canRequesterAbortChatRun respects device scoping); controller-identity guard on cleanup so concurrent chat.send + agent with the same idempotencyKey do not corrupt each other's registration.
  • What I did not verify: live end-to-end abort with a real upstream model call (coverage is at the gateway handler seam; the embedded runner already honors abortSignal via preexisting tests, and I did not re-run the full integration suite against a live provider).

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: runId collision between a concurrent chat.send and agent RPC (same idempotencyKey, same session) could in principle cause double-registration or premature cleanup.
    • Mitigation: registration is skipped when chatAbortControllers.has(runId) is already true, and cleanup only deletes entries whose controller matches the one this handler created (explicit test locks this in).
  • Risk: agent RPC runs without a resolvedSessionKey (e.g., group-only runs) can't be aborted by chat.abort because chat.abort requires a sessionKey.
    • Mitigation: accepted — matches chat.send's contract; the alternative would require a new session-less abort surface, which is out of scope for this fix (reporter's preferred sessions.stop path).

bitloi added 2 commits April 24, 2026 20:11
Register agent-RPC runs into chatAbortControllers and thread the
resulting AbortSignal into agentCommandFromIngress so chat.abort and
sessions.abort can interrupt runs started via the public `agent`
method, matching the behavior chat.send already provides.

Registration happens before the accepted ack so a client that aborts
immediately after seeing the ack cannot race the map entry, and cleanup
runs in the dispatcher's .finally() regardless of success or error.

Fixes openclaw#71128.
Guard dispatchAgentRunFromGateway's completion cleanup so it only
deletes the chatAbortControllers entry it created. Prevents a
concurrent chat.send run with the same runId from losing its
abort controller when the agent RPC run finishes.
@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: M labels Apr 24, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR fixes chat.abort and sessions.abort not being able to interrupt in-flight runs started via the public agent RPC. It registers an AbortController into chatAbortControllers before sending the accepted ack, threads the signal into agentCommandFromIngress, and cleans it up in a .finally() guarded by controller identity to avoid evicting a concurrent chat.send entry with the same runId.

  • The expiresAtMs stored in the chatAbortControllers entry is capped at 24 hours by resolveChatRunExpiresAtMs (its maxMs default), but the default agent timeout is 48 hours and operators can set timeout=0 for indefinite runs. Before this PR, agent runs had no entry in the map so server maintenance never touched them; after this PR, server-maintenance.ts will call abortChatRunById(…, { stopReason: \"timeout\" }) on any agent run alive for > 24 hours, silently truncating configured timeouts.

Confidence Score: 4/5

Safe to merge for most deployments; long-running agent runs (> 24h) will be force-aborted by server maintenance earlier than their configured timeout.

The core fix is correct and well-tested. One P1 issue: agent runs registered with expiresAtMs capped at 24h will be cut short by server maintenance even when the configured timeout is 48h or infinite, which is a behavioral regression for long-running autonomous agent runs.

src/gateway/server-methods/agent.ts — the expiresAtMs computation at line 896 needs to be aligned with the actual agent timeout to avoid premature maintenance-driven aborts.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gateway/server-methods/agent.ts
Line: 887-896

Comment:
**Maintenance timeout shorter than agent timeout for long-running runs**

`resolveChatRunExpiresAtMs` hard-caps `expiresAtMs` at `now + 24h` (its `maxMs` default), but `resolveAgentTimeoutMs` defaults to 48 hours and can return up to ~24.8 days for `timeout=0`. Before this PR, agent runs had no entry in `chatAbortControllers`, so the server maintenance loop (`server-maintenance.ts:109`, `abortChatRunById(…, { stopReason: "timeout" })`) never touched them. Now any agent run that has been going for more than 24 hours will be force-aborted by maintenance even if the operator configured a 48-hour or indefinite timeout — a silent regression for long-running autonomous agents. Consider passing `maxMs: timeoutMs + graceMs` to `resolveChatRunExpiresAtMs`, or capping `expiresAtMs` only at `MAX_SAFE_TIMEOUT_MS`, so the maintenance expiry matches the configured agent timeout rather than being silently truncated to 24 hours.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(gateway): scope agent abort cleanup ..." | Re-trigger Greptile

Comment thread src/gateway/server-methods/agent.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3fe11ab58a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/gateway/server-methods/agent.ts Outdated
Comment thread src/gateway/server-methods/agent.ts Outdated
bitloi and others added 3 commits April 24, 2026 20:54
…cise

Add an optional `kind` discriminator to ChatAbortControllerEntry and mark
agent-RPC registrations as `kind: "agent"`. `isChatSendRunActive` in the
event subscription layer and `hasActiveChatRun` in the agent.wait handler
now check `kind !== "agent"` so they keep meaning 'chat.send specifically
is active' — otherwise agent.wait would ignore the agent run in flight and
return a stale chat.send terminal snapshot (gateway-server-chat regression
caught on CI).
resolveChatRunExpiresAtMs defaults to a 24h maxMs cap that made sense
for chat.send but silently truncates long-running agent RPC runs (48h
default, ~24.8d for timeout=0). Pass an explicit maxMs of timeoutMs +
graceMs so the maintenance sweep stops force-aborting long autonomous
agents at 24h.
@openclaw-barnacle openclaw-barnacle Bot added app: web-ui App: web-ui size: L and removed size: M labels Apr 24, 2026
@steipete steipete merged commit 8cae2ed into openclaw:main Apr 24, 2026
67 checks passed
@steipete
Copy link
Copy Markdown
Contributor

Landed via squash merge onto main.

  • Local focused tests: pnpm test src/gateway/server-methods/agent.test.ts src/gateway/chat-abort.test.ts src/gateway/server-methods/chat.abort-authorization.test.ts src/gateway/server-methods/chat.abort-persistence.test.ts src/gateway/server-maintenance.test.ts (5 files, 62 tests)
  • Local changed gate: conflict markers, core typecheck, and core test typecheck passed; local lint hit an unrelated existing doctor lint outside this PR
  • CI: green on final PR SHA 22474b77f017fd6e7839e81425dec6ce660fe854
  • Source SHA: 22474b7
  • Merge commit: 8cae2ed

Thanks @bitloi!

@bitloi
Copy link
Copy Markdown
Contributor Author

bitloi commented Apr 24, 2026

Thanks for your review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: web-ui App: web-ui gateway Gateway runtime size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: chat.abort cannot stop active agent runs; only works for chat.send runs

2 participants