fix(agent): abort accepted gateway runs on signal#84381
Conversation
|
Codex review: needs real behavior proof before merge. Latest ClawSweeper review: 2026-05-22 08:12 UTC / May 22, 2026, 4:12 AM ET. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. source-reproducible: current main does not pass an abort signal or accepted-run abort hook into the Gateway-backed PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the Gateway/CLI abort fix after current-head SIGTERM proof and maintainer approval of the same-owner/admin abort authorization boundary. Do we have a high-confidence way to reproduce the issue? Yes, source-reproducible: current main does not pass an abort signal or accepted-run abort hook into the Gateway-backed Is this the best way to solve the issue? Yes, with one merge gate: carrying the signal through the CLI request, aborting accepted Gateway runs, and preserving duplicate/pre-registration state is the right boundary. The remaining issue is current-head proof and maintainer acceptance of the abort authorization semantics, not a narrower code defect I can identify. Label changes:
Label justifications:
Acceptance criteria:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 6981051682f2. |
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
48eaaa5 to
003d917
Compare
003d917 to
533728f
Compare
2a30b23 to
b431f57
Compare
36a4dc4 to
45a6f1b
Compare
|
CI ownership note, updated after rebase: I compared the refreshed PR CI run on head
The earlier PR run had 13 failures on base |
|
@clawsweeper ask is this code overly complex? loc is a bit high |
|
🦞👀 I queued a lightweight read-only assist pass. It will post a separate answer comment and will not edit the durable ClawSweeper review comment or trigger close, merge, repair, label, or branch changes. Request: repair |
|
ClawSweeper assist: Yes, LOC is high, but the supplied context suggests most of the size is from tests and edge-case coverage rather than gratuitous production code. The production change is still non-trivial because it crosses CLI signal handling, Gateway request aborts, accepted-run dedupe, and Evidence:
Suggested next action: If you want a real complexity/correctness judgment on whether this can be simplified safely, ask Source: #84381 (comment) |
|
@clawsweeper fix |
|
🦞🔧 Reason: implementation PR creation must be requested on an issue; use autofix or automerge for pull requests. This lane creates PRs only; it does not merge or close the issue. |
|
@clawsweeper repair |
|
ClawSweeper assist: Evidence:
Suggested next action: Comment Source: #84381 (comment) |
|
@clawsweeper fix ci |
|
🦞🔧 Reason: repair commands require a ClawSweeper PR or a PR opted into ClawSweeper autofix or automerge. Supported re-review commands work on open issues and PRs: |
|
ClawSweeper assist: Yes, LOC is high, but from the supplied context it looks mostly driven by tests and race/abort edge cases, not obviously gratuitous complexity. The production change is still inherently broad because it spans CLI signal handling, Gateway request aborts, accepted-run dedupe, duplicate run behavior, and Evidence:
Suggested next action: For a full complexity/correctness judgment on whether this can be simplified safely, ask Source: #84381 (comment) |
|
ClawSweeper assist: Evidence:
Suggested next action: Use Source: #84381 (comment) |
|
🦞👀 I queued a lightweight read-only assist pass. It will post a separate answer comment and will not edit the durable ClawSweeper review comment or trigger close, merge, repair, label, or branch changes. Request: is this code overly complex? loc is a bit high |
|
ClawSweeper assist: Yes, LOC is high, but the supplied context suggests the size is mostly from tests and race-condition coverage, not obviously needless complexity. The production change is still inherently broad because it crosses CLI signal handling, Gateway request aborts, accepted-run dedupe, duplicate-run behavior, and Evidence:
Suggested next action: For a real simplification/correctness review of the current head, ask Source: #84381 (comment) |
|
Verification for landing:
Also added the maintainer changelog entry and a small fixup for the Thanks @Kaspre. |
Fixes #71710.
This makes
openclaw agentsignal handling abort Gateway-backed runs instead of only killing the local CLI process. When the CLI receivesSIGINTorSIGTERM, it now aborts the active Gateway request, asks the Gateway to cancel the accepted agent run, retries that cancellation on the same connection, and falls back to a fresh authorizedchat.abortcall if the original connection closes before cancellation is confirmed.The Gateway side now registers agent abort state before sending the accepted ack, supports pre-registration
chat.abortand/stopraces, preserves pending accepted agent dedupe entries until the actual run timeout, and keeps duplicate retries from spawning a second agent run. The CLI exits with the normal signal status (130/143) after cleanup, including the case where the Gateway run resolves just as a signal arrives.Docs now call out the recommended hard-kill backstop for cron/systemd wrappers while describing the new graceful Gateway abort behavior.
Real behavior proof
Behavior or issue addressed:
openclaw agentGateway runs could leave the remote agent work alive after the local CLI receivedSIGTERM/SIGINT, especially after the Gateway had accepted the run. This patch makes the CLI request cancellation and makes the Gateway able to abort active and pre-registered accepted agent runs.Real environment tested: GCP Crabbox Spot VM in
us-east1-b, Nodev24.15.0, pnpm11.1.0, branch headea5dfbaf324669c917dac22084889a602cca9c2e, merge basec8a953af9371f0c1e5980283abf554f89f641fea.Exact steps or command run after this patch: Remote validation ran
pnpm install --frozen-lockfile --reporter=append-only, a real loopback Gateway/CLI SIGTERM proof script against the source CLI throughtsx, and focused Vitest coverage for the abort paths touched by the current-head CI fix. The earlier runtime-equivalent head also passedpnpm build, a loopback Gateway/CLI/provider SIGTERM proof, focused command/gateway suites, andpnpm check:changed; the follow-up current-head commit only updates test helper shape.Evidence after fix: Terminal output from the remote proof included:
Earlier runtime-head suites passed before the test-helper-only follow-up:
src/commands/agent-via-gateway.test.ts29 passed,src/gateway/server-methods/agent.test.ts112 passed,src/gateway/call.test.tsplussrc/gateway/client.watchdog.test.ts97 passed, andsrc/gateway/server-maintenance.test.ts14 passed.pnpm check:changedpassed all selected lanes, including typecheck, lint shards, and runtime import cycles.Observed result after fix: The Gateway-backed CLI run receives
SIGTERM, confirms the accepted Gateway agent run is aborted, closes the provider request, and exits as a signal termination (143) instead of leaving the provider stream/run alive. The model-override path preserves backend/admin authority for the fallback abort.What was not tested: No live third-party provider was called; the real proof used loopback Gateway/provider surfaces so abort/close behavior was deterministic. The full loopback provider-close proof and
pnpm check:changedwere run on736a39082690389e0b25b6baae089a7f03af5534; the current headea5dfbaf324669c917dac22084889a602cca9c2erefresh reran the source-CLI accepted-run SIGTERM proof and the focused abort tests after rebasing ontoc8a953af9371f0c1e5980283abf554f89f641fea.Review gates