fix(agents): surface user-visible error when embedded session is stuck or overflows context (#84536)#84602
fix(agents): surface user-visible error when embedded session is stuck or overflows context (#84536)#84602lml2468 wants to merge 1 commit into
Conversation
…k or overflows context Fixes openclaw#84536. Two root causes addressed: 1. Dead-code guard in agent-runner-execution.ts The hasPayloadText guard on the context overflow fallback made the branch unreachable in the common terminal-overflow path, because run.ts always includes an error payload when it reaches the terminal overflow return. This silently fell through to the success path where the payload might still get delivered, but the fallback never fired for aborted sessions. Fix: remove the guard so the friendly overflow message is always surfaced when meta.error is a context overflow error. 2. Stuck-session recovery produces no user notification recoverStuckDiagnosticSession aborts the embedded run and releases the lane, but the abort result had empty payloads so the user saw nothing. Fix: thread the abort reason (stuck_recovery) from abortAndDrainEmbeddedPiRun through handle.abort(reason) to AbortController.abort(reason), expose it as EmbeddedRunAttemptResult.abortReason, and in run.ts synthesize a user-visible error payload when abortReason is stuck_recovery and no other payload was generated.
|
Codex review: needs real behavior proof before merge. Latest ClawSweeper review: 2026-05-23 16:15 UTC / May 23, 2026, 12:15 PM ET. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
PR Surface View PR surface stats
Summary Reproducibility: yes. for the PR defect by source inspection: PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Mantis proof suggestion Risk before merge
Maintainer options:
Next step before merge Security Review findings
Review detailsBest possible solution: Keep the user-visible recovery direction, but wrap the embedded queue handle abort so timeout state and abort reason are separate, add focused reason-to-payload coverage, and prove one real channel notification. Do we have a high-confidence way to reproduce the issue? Yes for the PR defect by source inspection: Is this the best way to solve the issue? No as written; the product direction is reasonable, but the implementation must preserve the recovery reason without toggling timeout state, then prove the notification in a real channel path. Label justifications:
Full review comments:
Overall correctness: patch is incorrect What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against a04566da11ce. |
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
|
@clawsweeper re-review Fixed the P1 abort reason contract bug: Also added a regression test in |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
Fixes #84536.
Problem
Preemptive context overflow checks (
shouldPreemptivelyCompactBeforePrompt) fire during tool loops in embedded agent sessions. The session ends withembedded_run_agent_end+ error, but no error message is delivered to the channel. Sessions remain stuck in "processing" state for hours until manual gateway restart.Root Causes
Bug 1: Dead-code overflow fallback in
agent-runner-execution.tsThe
!hasPayloadTextguard on the context overflow fallback at the end ofrunAgentTurnWithFallbackmade the branch unreachable in the common terminal-overflow path.run.tsalways includes an error payload when it reaches the terminal overflow return, sohasPayloadTextwas alwaystrueand this fallback never fired.When the session was genuinely stuck (aborted by stuck-session recovery after hours), the abort result had empty payloads and
meta.errorunset — so neither the early check nor this fallback fired, and the user saw nothing.Fix: Remove the
!hasPayloadTextguard. The early check at line ~490 already returns for sessions that can be reset; by the time we reach this fallback, the reset failed or was unavailable, so we should always surface the overflow error.Bug 2: Stuck-session recovery produces no user notification
recoverStuckDiagnosticSession(triggered afterstuckSessionWarnMsof a session in processing state) callsabortAndDrainEmbeddedPiRunand releases the lane — but the aborted run returns with empty payloads and no error inmeta, so the user receives no message.Fix: Thread the abort reason (
"stuck_recovery") through:abortAndDrainEmbeddedPiRun→handle.abort(reason)(when reason is provided)handle.abort(reason)→runAbortController.abort(reason)(abort signal now carries reason)runAbortController.signal.reason→EmbeddedRunAttemptResult.abortReasonattempt.abortReason === "stuck_recovery"inrun.ts→ synthesize a user-visible error payloadThe synthesized payload delivers: "⚠️ Your session was stuck and has been automatically recovered. Please try again."
Changes
src/auto-reply/reply/agent-runner-execution.ts!hasPayloadTextdead guard; always surface overflow errorsrc/agents/pi-embedded-runner/run-state.tsEmbeddedPiQueueHandle.abort: (reason?: unknown) => voidsrc/agents/pi-embedded-runner/runs.tsparams.reasontohandle.abort(reason)inabortAndDrainEmbeddedPiRunsrc/agents/pi-embedded-runner/run/types.tsabortReason?: stringtoEmbeddedRunAttemptResultsrc/agents/pi-embedded-runner/run/attempt.tsrunAbortController.signal.reasonand expose asabortReasonsrc/agents/pi-embedded-runner/run.tsabortReason === "stuck_recovery"Design Notes
AbortController.abort(reason)/signal.reasonpattern already present in the codebase (seesessions_yieldabort)stuckRecoveryPayloadis only synthesized whenpayloadsForTerminalPathis empty — normal completions and overflow terminal paths keep their existing payloadslivenessStateis set to"blocked"for stuck-recovery aborts to match the existing convention for error-terminal pathsAcceptance criteria
Fix commit (f4e723e)
Root cause fix:
queueHandle.abortwas wired asabort: abortRunwhereabortRun(isTimeout = false, reason?)takes the timeout flag as first arg. Callinghandle.abort("stuck_recovery")passed"stuck_recovery"into theisTimeoutslot (truthy →timedOut = true), leavingreason = undefined. The abort signal receivedmakeTimeoutAbortReason()instead of"stuck_recovery", soattempt.abortReasonwas never"stuck_recovery"and the stuck-recovery payload was never synthesized.Fix:
src/agents/pi-embedded-runner/run/attempt.ts:3314This matches the updated
EmbeddedPiQueueHandletype (abort: (reason?: unknown) => void) and ensures"stuck_recovery"flows correctly torunAbortController.abort(reason)→signal.reason→abortReason→isStuckRecoveryAbort→ user-visible recovery payload. ✓Tests added:
src/agents/pi-embedded-runner/runs.test.ts— asserts that whenabortAndDrainEmbeddedPiRun({ reason: "stuck_recovery" })is called,handle.abortreceives"stuck_recovery"as the reason (not isTimeout).All three acceptance criteria test commands pass.