Stuck-session recovery aborts long-but-active agent runs at warnMs×3 (~6min) with misleading "Reply operation aborted by user" reason

## Summary

On `2026.5.27`, stuck-session diagnostic recovery (`recoverStuckDiagnosticSession`) aborts **legitimately long, actively-working** agent runs (e.g. a `claude-cli` backend agent doing a deep review with `thinking: max`) once the run age reaches the default abort threshold: `stuckSessionWarnMs` (2 min) × `STALLED_EMBEDDED_RUN_ABORT_WARN_MULTIPLIER` (3) = **6 min**. The abort surfaces as `AbortError: "Reply operation aborted by user"`, which is misleading — no user action occurred.

Net effect: any agent turn that legitimately takes >6 min is killed before producing its final reply (channel records `no_final` / no delivery), so on a Feishu channel (and presumably others) the agent simply "never replies".

## Environment

- OpenClaw `2026.5.27`, macOS, launchd gateway
- Agent backend: `claude-cli` runtime (`anthropic/claude-opus-4-8`), `thinkingDefault: max`, doing long content-review turns (reads several files + greps + extended thinking)
- No `diagnostics` block configured (defaults)

## Symptom

- Short turns (<~5 min) reply normally; long turns get cut at ~6 min with no final reply.
- Gateway log shows only:
  ```
  [agent/cli-backend] claude live session close: provider=claude-cli model=... reason=abort
  feishu[...]: dispatch complete (queuedFinal=false, replies=0)
  ```
  No timeout text ("produced no output" / "exceeded timeout"), no channel error — very hard to diagnose from logs alone.

## Root cause (captured)

Tracing `AbortController.prototype.abort` in the gateway process, the abort at ~6 min into the run was:

```
reason = AbortError: "Reply operation aborted by user"
  at abortInternally               (reply-run-registry-*.js)
  at abortWithReason               (reply-run-registry)
  at abortByUser                   (reply-run-registry)
  at abortReplyRunBySessionId      (reply-run-registry)
  at abortEmbeddedPiRun            (runs-*.js)
  at abortAndDrainEmbeddedPiRun    (runs)
  at recoverStuckDiagnosticSession (diagnostic-stuck-session-recovery.runtime-*.js)
  at                               (diagnostic-*.js)
```

Threshold math (`diagnostic-*.js`): with no config, `resolveStalledEmbeddedRunAbortMs(120000)` = `max(MIN_STALLED_EMBEDDED_RUN_ABORT_MS 5min, 120000 × 3)` = `360000` = exactly 6 min, matching the observed kill time (+ `STUCK_SESSION_ABORT_SETTLE_MS` 15s settle).

## Two distinct problems

1. **Active work misclassified as stalled.** A single long `model_call` / `embedded_run` still actively producing tokens/tool calls (just without frequent "visible progress" deltas) is treated as stuck and aborted. In our capture, the CLI had emitted output only ~18 s before the abort (largest silent gap in the entire run was ~145 s) — clearly not hung. Stuck recovery should not abort a run whose underlying CLI/model is still actively streaming output.

2. **Misleading abort reason.** Recovery aborts via `abortByUser()` → `AbortError: "Reply operation aborted by user"`. This is automatic recovery, not a user action; the reason made root-causing very hard (looks like a user cancel, not a timeout/recovery). Please surface an accurate reason such as `stuck_session_recovery`.

## Regression

Did not happen on `2026.5.5` (its recovery runtime is ~86 lines, no `abortEmbeddedPiRun` / multiplier / `stuckSessionAbortMs`). The abort-of-active-runs path was added/expanded in `2026.5.12` (added `diagnostics.stuckSessionAbortMs`, outcome-driven recovery) and `2026.5.17` (abort active embedded runs on stale native tool call). Upgrading 5.5 → 5.27 introduced it.

## Workaround

```json
"diagnostics": { "stuckSessionAbortMs": 1800000 }
```
(Adding the top-level `diagnostics` object is not hot-reloadable — it triggers a full gateway restart to apply.)

## Suggested fixes

- Don't abort a run as "stuck" while its CLI/model is still actively producing output (gate on output liveness, not just age / coarse progress).
- Use a truthful abort reason (e.g. `stuck_session_recovery`) instead of `"Reply operation aborted by user"`.
- Document `diagnostics.stuckSessionWarnMs` / `stuckSessionAbortMs` and the default derivation (`warnMs × 3`, min 5 min) so the effective ~6 min cap on long agent turns is discoverable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stuck-session recovery aborts long-but-active agent runs at warnMs×3 (~6min) with misleading "Reply operation aborted by user" reason #88870

Summary

Environment

Symptom

Root cause (captured)

Two distinct problems

Regression

Workaround

Suggested fixes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Stuck-session recovery aborts long-but-active agent runs at warnMs×3 (~6min) with misleading "Reply operation aborted by user" reason #88870

Description

Summary

Environment

Symptom

Root cause (captured)

Two distinct problems

Regression

Workaround

Suggested fixes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions