Problem
When a user sends /stop or a new message while the agent is mid-run (executing tools, streaming, running subprocesses), the current run may continue producing side effects: launching more tools, emitting stale progress updates, and finishing subprocess chains.
This creates a real UX and safety issue: the user loses confidence in their ability to regain control.
Observed behavior
/stop cuts some parts of the system but not all
- Child processes may keep running after chat is aborted
- A tool batch may continue with subsequent tools after partial cancellation
- Stale progress/typing/messages keep arriving after stop
- New user messages may not immediately supersede the active run
Expected behavior
If the user sends a new message or /stop, the previous run should stop producing effects immediately. The new message should become the dominant instruction.
Existing primitives (they're good!)
OpenClaw already has solid abort primitives scattered across subsystems:
chat.abort RPC handler
abortEmbeddedPiRun() for embedded agent runs
clearSessionQueues() for queue cleanup
managedRun.cancel("manual-cancel") for exec processes
cancel(runId) / cancelScope(scopeKey) in the process supervisor
replyRunRegistry.abort() for reply run tracking
abortedLastRun flag in session store
handlerGeneration invalidation pattern in heartbeat-wake
These are all good building blocks. The issue is not missing cancellation, but missing coordination between them.
What's missing: unified run invalidation
The gap is a single coherent guarantee that:
- A new user message or
/stop invalidates the active run
- The invalidated run cannot produce new side effects (messages, tool calls, progress, typing)
- Subprocesses owned by the invalidated run are cancelled
- Pending tool calls in the invalidated run are skipped
- The new user message becomes the dominant instruction immediately
Proposed approach
Introduce stronger run-scoped interruption semantics, inspired by patterns from Hermes (which implements a well-tested version of this):
1. Run generation counter per session
A simple incrementing counter. When abort or new message arrives, increment generation. All downstream checks validate their captured generation is still current.
2. Pre-tool gate
Before each tool execution, check if the run's generation is still current. If not, skip the tool and return a cancelled result. (Hermes tests this explicitly in test_all_tools_skipped_when_interrupted.)
3. Stale output fence
Prevent stale runs from emitting visible effects. Before emitting streaming deltas, typing indicators, progress updates, or final messages: check generation. The pattern already exists in heartbeat-wake.ts — apply it to the reply pipeline.
4. Stronger subprocess cancellation
Wire exec/supervisor processes to session run scope. On generation change, cancel associated processes.
5. New message takeover
When a new user message arrives during an active run: increment generation → cancel active processes → abort embedded run → clear queues → new message becomes next input.
Prior art
Hermes agent demonstrates these patterns with good test coverage:
- Thread-scoped interrupt signaling (
tools/interrupt.py)
- Pre-tool interrupt checks with test coverage
- Gateway run generation invalidation for stale outputs
- SIGTERM→SIGKILL escalation for resistant processes
- Pending message queue drain and combination
The goal is not a line-by-line port, but adapting these concepts to OpenClaw's async architecture.
Benefits
- Safer production behavior (tool chains stop reliably)
- Stronger user control and trust
- More predictable
/stop semantics
- Fewer stale messages after abort
- Foundation for safer autonomous operation
I'm willing to contribute a PR
I have a prototype implementation plan and would be happy to contribute a PR if the maintainers are interested. The implementation is designed to layer on top of existing primitives without breaking current behavior.
This would be AI-assisted (Claude Code) with testing.
Problem
When a user sends
/stopor a new message while the agent is mid-run (executing tools, streaming, running subprocesses), the current run may continue producing side effects: launching more tools, emitting stale progress updates, and finishing subprocess chains.This creates a real UX and safety issue: the user loses confidence in their ability to regain control.
Observed behavior
/stopcuts some parts of the system but not allExpected behavior
If the user sends a new message or
/stop, the previous run should stop producing effects immediately. The new message should become the dominant instruction.Existing primitives (they're good!)
OpenClaw already has solid abort primitives scattered across subsystems:
chat.abortRPC handlerabortEmbeddedPiRun()for embedded agent runsclearSessionQueues()for queue cleanupmanagedRun.cancel("manual-cancel")for exec processescancel(runId)/cancelScope(scopeKey)in the process supervisorreplyRunRegistry.abort()for reply run trackingabortedLastRunflag in session storehandlerGenerationinvalidation pattern in heartbeat-wakeThese are all good building blocks. The issue is not missing cancellation, but missing coordination between them.
What's missing: unified run invalidation
The gap is a single coherent guarantee that:
/stopinvalidates the active runProposed approach
Introduce stronger run-scoped interruption semantics, inspired by patterns from Hermes (which implements a well-tested version of this):
1. Run generation counter per session
A simple incrementing counter. When abort or new message arrives, increment generation. All downstream checks validate their captured generation is still current.
2. Pre-tool gate
Before each tool execution, check if the run's generation is still current. If not, skip the tool and return a cancelled result. (Hermes tests this explicitly in
test_all_tools_skipped_when_interrupted.)3. Stale output fence
Prevent stale runs from emitting visible effects. Before emitting streaming deltas, typing indicators, progress updates, or final messages: check generation. The pattern already exists in
heartbeat-wake.ts— apply it to the reply pipeline.4. Stronger subprocess cancellation
Wire exec/supervisor processes to session run scope. On generation change, cancel associated processes.
5. New message takeover
When a new user message arrives during an active run: increment generation → cancel active processes → abort embedded run → clear queues → new message becomes next input.
Prior art
Hermes agent demonstrates these patterns with good test coverage:
tools/interrupt.py)The goal is not a line-by-line port, but adapting these concepts to OpenClaw's async architecture.
Benefits
/stopsemanticsI'm willing to contribute a PR
I have a prototype implementation plan and would be happy to contribute a PR if the maintainers are interested. The implementation is designed to layer on top of existing primitives without breaking current behavior.
This would be AI-assisted (Claude Code) with testing.