Skip to content

Stuck-session recovery aborts long-but-active agent runs at warnMs×3 (~6min) with misleading "Reply operation aborted by user" reason #88870

@tehao999

Description

@tehao999

Summary

On 2026.5.27, stuck-session diagnostic recovery (recoverStuckDiagnosticSession) aborts legitimately long, actively-working agent runs (e.g. a claude-cli backend agent doing a deep review with thinking: max) once the run age reaches the default abort threshold: stuckSessionWarnMs (2 min) × STALLED_EMBEDDED_RUN_ABORT_WARN_MULTIPLIER (3) = 6 min. The abort surfaces as AbortError: "Reply operation aborted by user", which is misleading — no user action occurred.

Net effect: any agent turn that legitimately takes >6 min is killed before producing its final reply (channel records no_final / no delivery), so on a Feishu channel (and presumably others) the agent simply "never replies".

Environment

  • OpenClaw 2026.5.27, macOS, launchd gateway
  • Agent backend: claude-cli runtime (anthropic/claude-opus-4-8), thinkingDefault: max, doing long content-review turns (reads several files + greps + extended thinking)
  • No diagnostics block configured (defaults)

Symptom

  • Short turns (<~5 min) reply normally; long turns get cut at ~6 min with no final reply.
  • Gateway log shows only:
    [agent/cli-backend] claude live session close: provider=claude-cli model=... reason=abort
    feishu[...]: dispatch complete (queuedFinal=false, replies=0)
    
    No timeout text ("produced no output" / "exceeded timeout"), no channel error — very hard to diagnose from logs alone.

Root cause (captured)

Tracing AbortController.prototype.abort in the gateway process, the abort at ~6 min into the run was:

reason = AbortError: "Reply operation aborted by user"
  at abortInternally               (reply-run-registry-*.js)
  at abortWithReason               (reply-run-registry)
  at abortByUser                   (reply-run-registry)
  at abortReplyRunBySessionId      (reply-run-registry)
  at abortEmbeddedPiRun            (runs-*.js)
  at abortAndDrainEmbeddedPiRun    (runs)
  at recoverStuckDiagnosticSession (diagnostic-stuck-session-recovery.runtime-*.js)
  at                               (diagnostic-*.js)

Threshold math (diagnostic-*.js): with no config, resolveStalledEmbeddedRunAbortMs(120000) = max(MIN_STALLED_EMBEDDED_RUN_ABORT_MS 5min, 120000 × 3) = 360000 = exactly 6 min, matching the observed kill time (+ STUCK_SESSION_ABORT_SETTLE_MS 15s settle).

Two distinct problems

  1. Active work misclassified as stalled. A single long model_call / embedded_run still actively producing tokens/tool calls (just without frequent "visible progress" deltas) is treated as stuck and aborted. In our capture, the CLI had emitted output only ~18 s before the abort (largest silent gap in the entire run was ~145 s) — clearly not hung. Stuck recovery should not abort a run whose underlying CLI/model is still actively streaming output.

  2. Misleading abort reason. Recovery aborts via abortByUser()AbortError: "Reply operation aborted by user". This is automatic recovery, not a user action; the reason made root-causing very hard (looks like a user cancel, not a timeout/recovery). Please surface an accurate reason such as stuck_session_recovery.

Regression

Did not happen on 2026.5.5 (its recovery runtime is ~86 lines, no abortEmbeddedPiRun / multiplier / stuckSessionAbortMs). The abort-of-active-runs path was added/expanded in 2026.5.12 (added diagnostics.stuckSessionAbortMs, outcome-driven recovery) and 2026.5.17 (abort active embedded runs on stale native tool call). Upgrading 5.5 → 5.27 introduced it.

Workaround

"diagnostics": { "stuckSessionAbortMs": 1800000 }

(Adding the top-level diagnostics object is not hot-reloadable — it triggers a full gateway restart to apply.)

Suggested fixes

  • Don't abort a run as "stuck" while its CLI/model is still actively producing output (gate on output liveness, not just age / coarse progress).
  • Use a truthful abort reason (e.g. stuck_session_recovery) instead of "Reply operation aborted by user".
  • Document diagnostics.stuckSessionWarnMs / stuckSessionAbortMs and the default derivation (warnMs × 3, min 5 min) so the effective ~6 min cap on long agent turns is discoverable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions