Stronger run interruptibility: unified generation invalidation and stale-output fencing

## Problem

When a user sends `/stop` or a new message while the agent is mid-run (executing tools, streaming, running subprocesses), the current run may continue producing side effects: launching more tools, emitting stale progress updates, and finishing subprocess chains.

This creates a real UX and safety issue: the user loses confidence in their ability to regain control.

### Observed behavior
- `/stop` cuts some parts of the system but not all
- Child processes may keep running after chat is aborted
- A tool batch may continue with subsequent tools after partial cancellation
- Stale progress/typing/messages keep arriving after stop
- New user messages may not immediately supersede the active run

### Expected behavior
If the user sends a new message or `/stop`, the previous run should stop producing effects immediately. The new message should become the dominant instruction.

## Existing primitives (they're good!)

OpenClaw already has solid abort primitives scattered across subsystems:

- `chat.abort` RPC handler
- `abortEmbeddedPiRun()` for embedded agent runs
- `clearSessionQueues()` for queue cleanup
- `managedRun.cancel("manual-cancel")` for exec processes
- `cancel(runId)` / `cancelScope(scopeKey)` in the process supervisor
- `replyRunRegistry.abort()` for reply run tracking
- `abortedLastRun` flag in session store
- `handlerGeneration` invalidation pattern in heartbeat-wake

These are all good building blocks. The issue is not missing cancellation, but missing **coordination** between them.

## What's missing: unified run invalidation

The gap is a single coherent guarantee that:

1. A new user message or `/stop` **invalidates** the active run
2. The invalidated run **cannot** produce new side effects (messages, tool calls, progress, typing)
3. Subprocesses owned by the invalidated run are **cancelled**
4. Pending tool calls in the invalidated run are **skipped**
5. The new user message becomes the **dominant** instruction immediately

## Proposed approach

Introduce stronger run-scoped interruption semantics, inspired by patterns from [Hermes](https://github.com/NousResearch/hermes-agent) (which implements a well-tested version of this):

### 1. Run generation counter per session
A simple incrementing counter. When abort or new message arrives, increment generation. All downstream checks validate their captured generation is still current.

### 2. Pre-tool gate
Before each tool execution, check if the run's generation is still current. If not, skip the tool and return a cancelled result. (Hermes tests this explicitly in `test_all_tools_skipped_when_interrupted`.)

### 3. Stale output fence
Prevent stale runs from emitting visible effects. Before emitting streaming deltas, typing indicators, progress updates, or final messages: check generation. The pattern already exists in `heartbeat-wake.ts` — apply it to the reply pipeline.

### 4. Stronger subprocess cancellation
Wire exec/supervisor processes to session run scope. On generation change, cancel associated processes.

### 5. New message takeover
When a new user message arrives during an active run: increment generation → cancel active processes → abort embedded run → clear queues → new message becomes next input.

## Prior art

Hermes agent demonstrates these patterns with good test coverage:
- Thread-scoped interrupt signaling (`tools/interrupt.py`)
- Pre-tool interrupt checks with test coverage
- Gateway run generation invalidation for stale outputs
- SIGTERM→SIGKILL escalation for resistant processes
- Pending message queue drain and combination

The goal is not a line-by-line port, but adapting these concepts to OpenClaw's async architecture.

## Benefits

- Safer production behavior (tool chains stop reliably)
- Stronger user control and trust
- More predictable `/stop` semantics
- Fewer stale messages after abort
- Foundation for safer autonomous operation

## I'm willing to contribute a PR

I have a prototype implementation plan and would be happy to contribute a PR if the maintainers are interested. The implementation is designed to layer on top of existing primitives without breaking current behavior.

This would be AI-assisted (Claude Code) with testing.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stronger run interruptibility: unified generation invalidation and stale-output fencing #70319

Problem

Observed behavior

Expected behavior

Existing primitives (they're good!)

What's missing: unified run invalidation

Proposed approach

1. Run generation counter per session

2. Pre-tool gate

3. Stale output fence

4. Stronger subprocess cancellation

5. New message takeover

Prior art

Benefits

I'm willing to contribute a PR

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Stronger run interruptibility: unified generation invalidation and stale-output fencing #70319

Description

Problem

Observed behavior

Expected behavior

Existing primitives (they're good!)

What's missing: unified run invalidation

Proposed approach

1. Run generation counter per session

2. Pre-tool gate

3. Stale output fence

4. Stronger subprocess cancellation

5. New message takeover

Prior art

Benefits

I'm willing to contribute a PR

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions