-
-
Notifications
You must be signed in to change notification settings - Fork 79.2k
Stuck-session recovery aborts long-but-active agent runs at warnMs×3 (~6min) with misleading "Reply operation aborted by user" reason #88870
Copy link
Copy link
Open
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.
Metadata
Metadata
Assignees
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Summary
On
2026.5.27, stuck-session diagnostic recovery (recoverStuckDiagnosticSession) aborts legitimately long, actively-working agent runs (e.g. aclaude-clibackend agent doing a deep review withthinking: max) once the run age reaches the default abort threshold:stuckSessionWarnMs(2 min) ×STALLED_EMBEDDED_RUN_ABORT_WARN_MULTIPLIER(3) = 6 min. The abort surfaces asAbortError: "Reply operation aborted by user", which is misleading — no user action occurred.Net effect: any agent turn that legitimately takes >6 min is killed before producing its final reply (channel records
no_final/ no delivery), so on a Feishu channel (and presumably others) the agent simply "never replies".Environment
2026.5.27, macOS, launchd gatewayclaude-cliruntime (anthropic/claude-opus-4-8),thinkingDefault: max, doing long content-review turns (reads several files + greps + extended thinking)diagnosticsblock configured (defaults)Symptom
Root cause (captured)
Tracing
AbortController.prototype.abortin the gateway process, the abort at ~6 min into the run was:Threshold math (
diagnostic-*.js): with no config,resolveStalledEmbeddedRunAbortMs(120000)=max(MIN_STALLED_EMBEDDED_RUN_ABORT_MS 5min, 120000 × 3)=360000= exactly 6 min, matching the observed kill time (+STUCK_SESSION_ABORT_SETTLE_MS15s settle).Two distinct problems
Active work misclassified as stalled. A single long
model_call/embedded_runstill actively producing tokens/tool calls (just without frequent "visible progress" deltas) is treated as stuck and aborted. In our capture, the CLI had emitted output only ~18 s before the abort (largest silent gap in the entire run was ~145 s) — clearly not hung. Stuck recovery should not abort a run whose underlying CLI/model is still actively streaming output.Misleading abort reason. Recovery aborts via
abortByUser()→AbortError: "Reply operation aborted by user". This is automatic recovery, not a user action; the reason made root-causing very hard (looks like a user cancel, not a timeout/recovery). Please surface an accurate reason such asstuck_session_recovery.Regression
Did not happen on
2026.5.5(its recovery runtime is ~86 lines, noabortEmbeddedPiRun/ multiplier /stuckSessionAbortMs). The abort-of-active-runs path was added/expanded in2026.5.12(addeddiagnostics.stuckSessionAbortMs, outcome-driven recovery) and2026.5.17(abort active embedded runs on stale native tool call). Upgrading 5.5 → 5.27 introduced it.Workaround
(Adding the top-level
diagnosticsobject is not hot-reloadable — it triggers a full gateway restart to apply.)Suggested fixes
stuck_session_recovery) instead of"Reply operation aborted by user".diagnostics.stuckSessionWarnMs/stuckSessionAbortMsand the default derivation (warnMs × 3, min 5 min) so the effective ~6 min cap on long agent turns is discoverable.