Skip to content

fix(agents): surface user-visible error when embedded session is stuck or overflows context (#84536)#84602

Open
lml2468 wants to merge 1 commit into
openclaw:mainfrom
lml2468:fix/issue-84536
Open

fix(agents): surface user-visible error when embedded session is stuck or overflows context (#84536)#84602
lml2468 wants to merge 1 commit into
openclaw:mainfrom
lml2468:fix/issue-84536

Conversation

@lml2468

@lml2468 lml2468 commented May 20, 2026

Copy link
Copy Markdown
Contributor

Fixes #84536.

Problem

Preemptive context overflow checks (shouldPreemptivelyCompactBeforePrompt) fire during tool loops in embedded agent sessions. The session ends with embedded_run_agent_end + error, but no error message is delivered to the channel. Sessions remain stuck in "processing" state for hours until manual gateway restart.

Root Causes

Bug 1: Dead-code overflow fallback in agent-runner-execution.ts

The !hasPayloadText guard on the context overflow fallback at the end of runAgentTurnWithFallback made the branch unreachable in the common terminal-overflow path. run.ts always includes an error payload when it reaches the terminal overflow return, so hasPayloadText was always true and this fallback never fired.

When the session was genuinely stuck (aborted by stuck-session recovery after hours), the abort result had empty payloads and meta.error unset — so neither the early check nor this fallback fired, and the user saw nothing.

Fix: Remove the !hasPayloadText guard. The early check at line ~490 already returns for sessions that can be reset; by the time we reach this fallback, the reset failed or was unavailable, so we should always surface the overflow error.

Bug 2: Stuck-session recovery produces no user notification

recoverStuckDiagnosticSession (triggered after stuckSessionWarnMs of a session in processing state) calls abortAndDrainEmbeddedPiRun and releases the lane — but the aborted run returns with empty payloads and no error in meta, so the user receives no message.

Fix: Thread the abort reason ("stuck_recovery") through:

  1. abortAndDrainEmbeddedPiRunhandle.abort(reason) (when reason is provided)
  2. handle.abort(reason)runAbortController.abort(reason) (abort signal now carries reason)
  3. runAbortController.signal.reasonEmbeddedRunAttemptResult.abortReason
  4. attempt.abortReason === "stuck_recovery" in run.ts → synthesize a user-visible error payload

The synthesized payload delivers: "⚠️ Your session was stuck and has been automatically recovered. Please try again."

Changes

File Change
src/auto-reply/reply/agent-runner-execution.ts Remove !hasPayloadText dead guard; always surface overflow error
src/agents/pi-embedded-runner/run-state.ts EmbeddedPiQueueHandle.abort: (reason?: unknown) => void
src/agents/pi-embedded-runner/runs.ts Pass params.reason to handle.abort(reason) in abortAndDrainEmbeddedPiRun
src/agents/pi-embedded-runner/run/types.ts Add abortReason?: string to EmbeddedRunAttemptResult
src/agents/pi-embedded-runner/run/attempt.ts Read runAbortController.signal.reason and expose as abortReason
src/agents/pi-embedded-runner/run.ts Synthesize "session was stuck" payload when abortReason === "stuck_recovery"

Design Notes

  • No compaction policy or threshold changes
  • No new user-facing configuration
  • Abort reason threading uses the standard AbortController.abort(reason) / signal.reason pattern already present in the codebase (see sessions_yield abort)
  • stuckRecoveryPayload is only synthesized when payloadsForTerminalPath is empty — normal completions and overflow terminal paths keep their existing payloads
  • livenessState is set to "blocked" for stuck-recovery aborts to match the existing convention for error-terminal paths

Acceptance criteria

node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run.overflow-compaction.loop.test.ts
node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.spawn-workspace.context-engine.test.ts
node scripts/run-vitest.mjs src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts

Fix commit (f4e723e)

Root cause fix: queueHandle.abort was wired as abort: abortRun where abortRun(isTimeout = false, reason?) takes the timeout flag as first arg. Calling handle.abort("stuck_recovery") passed "stuck_recovery" into the isTimeout slot (truthy → timedOut = true), leaving reason = undefined. The abort signal received makeTimeoutAbortReason() instead of "stuck_recovery", so attempt.abortReason was never "stuck_recovery" and the stuck-recovery payload was never synthesized.

Fix: src/agents/pi-embedded-runner/run/attempt.ts:3314

-  abort: abortRun,
+  abort: (reason?: unknown) => abortRun(false, reason),

This matches the updated EmbeddedPiQueueHandle type (abort: (reason?: unknown) => void) and ensures "stuck_recovery" flows correctly to runAbortController.abort(reason)signal.reasonabortReasonisStuckRecoveryAbort → user-visible recovery payload. ✓

Tests added: src/agents/pi-embedded-runner/runs.test.ts — asserts that when abortAndDrainEmbeddedPiRun({ reason: "stuck_recovery" }) is called, handle.abort receives "stuck_recovery" as the reason (not isTimeout).

All three acceptance criteria test commands pass.

…k or overflows context

Fixes openclaw#84536.

Two root causes addressed:

1. Dead-code guard in agent-runner-execution.ts
   The hasPayloadText guard on the context overflow fallback made the
   branch unreachable in the common terminal-overflow path, because
   run.ts always includes an error payload when it reaches the terminal
   overflow return. This silently fell through to the success path where
   the payload might still get delivered, but the fallback never fired
   for aborted sessions.
   Fix: remove the guard so the friendly overflow message is always
   surfaced when meta.error is a context overflow error.

2. Stuck-session recovery produces no user notification
   recoverStuckDiagnosticSession aborts the embedded run and releases
   the lane, but the abort result had empty payloads so the user saw
   nothing.

   Fix: thread the abort reason (stuck_recovery) from
   abortAndDrainEmbeddedPiRun through handle.abort(reason) to
   AbortController.abort(reason), expose it as
   EmbeddedRunAttemptResult.abortReason, and in run.ts synthesize a
   user-visible error payload when abortReason is stuck_recovery and
   no other payload was generated.
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 20, 2026
@clawsweeper

clawsweeper Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Latest ClawSweeper review: 2026-05-23 16:15 UTC / May 23, 2026, 12:15 PM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

PR Surface
Source +56. Total +56 across 6 files.

View PR surface stats
Area Files Added Removed Net
Source 6 63 7 +56
Tests 0 0 0 0
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 6 63 7 +56

Summary
Threads an abort reason through embedded Pi run abort/drain handling, adds stuck-recovery payload synthesis, and removes the context-overflow payload-text guard.

Reproducibility: yes. for the PR defect by source inspection: abortAndDrainEmbeddedPiRun({ reason: "stuck_recovery" }) calls handle.abort(params.reason), while the embedded handle still maps abort to abortRun(isTimeout, reason?). I did not run tests because this review is read-only.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🧂 unranked krab
Summary: The PR targets an important bug, but the current head still has a blocking abort-reason defect and lacks real behavior proof.

Rank-up moves:

  • Fix the abort handle contract so stuck_recovery reaches AbortSignal.reason without setting timeout state.
  • Add focused coverage for reason propagation and the synthesized stuck-recovery payload.
  • Add redacted real behavior proof showing the channel receives the recovery or overflow error.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: No after-fix real setup proof is present; add redacted terminal/log output, a channel transcript, or a recording showing the recovery or overflow error, then update the PR body or ask for @clawsweeper re-review.

Mantis proof suggestion
A live Telegram transcript can directly show whether the terminal overflow or stuck-recovery abort posts one visible retry/reset error instead of going silent. A maintainer can ask Mantis to capture proof by posting a new PR comment that starts with the OpenClaw Mantis account mention, followed by:

telegram live: verify that a terminal embedded context-overflow or stuck-recovery abort posts one visible retry/reset error to the chat.

Risk before merge

  • Merging as-is can still leave stuck-session recovery silent because stuck_recovery is passed into the timeout-flag parameter and never reaches attempt.abortReason.
  • The visible channel notification remains unproven in a real setup; the PR body lists commands and claims, but no redacted output, channel transcript, recording, or logs.
  • The PR discussion says the P1 wrapper/test fix was added, but the actual head facafffb0b475eae56cefafc1ddaf33cb36fe46c does not contain that described change.

Maintainer options:

  1. Fix abort reason propagation (recommended)
    Change the active embedded queue handle so handle.abort(reason) calls abortRun(false, reason) or an equivalent reason-first wrapper, then cover the reason-to-payload path.
  2. Require channel-visible proof
    After the code fix, require redacted terminal output, logs, a recording, or a visible channel transcript showing the recovery or overflow error is delivered after this patch.
  3. Pause if the head remains inconsistent
    If the claimed follow-up commit cannot be pushed to this branch, pause or replace the PR rather than merging a head that still contains the known blocker.

Next step before merge
Needs contributor or maintainer action on the PR head plus real behavior proof; ClawSweeper should not repair-gate this external PR while the proof gate is missing.

Security
Cleared: The diff is limited to embedded-run abort/error handling and does not introduce a concrete security or supply-chain concern.

Review findings

  • [P1] Preserve the recovery reason when aborting — src/agents/pi-embedded-runner/runs.ts:521
Review details

Best possible solution:

Keep the user-visible recovery direction, but wrap the embedded queue handle abort so timeout state and abort reason are separate, add focused reason-to-payload coverage, and prove one real channel notification.

Do we have a high-confidence way to reproduce the issue?

Yes for the PR defect by source inspection: abortAndDrainEmbeddedPiRun({ reason: "stuck_recovery" }) calls handle.abort(params.reason), while the embedded handle still maps abort to abortRun(isTimeout, reason?). I did not run tests because this review is read-only.

Is this the best way to solve the issue?

No as written; the product direction is reasonable, but the implementation must preserve the recovery reason without toggling timeout state, then prove the notification in a real channel path.

Label justifications:

  • P1: The PR targets a silent embedded-session failure that can leave real channel workflows stuck until manual recovery.
  • merge-risk: 🚨 session-state: The current patch changes terminal recovery/liveness handling, but the abort reason bug can leave session state stuck or misclassified.
  • merge-risk: 🚨 message-delivery: The intended user-visible recovery message is still not delivered when the abort reason is lost.
  • rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🧂 unranked krab, patch quality is 🧂 unranked krab, and The PR targets an important bug, but the current head still has a blocking abort-reason defect and lacks real behavior proof.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: No after-fix real setup proof is present; add redacted terminal/log output, a channel transcript, or a recording showing the recovery or overflow error, then update the PR body or ask for @clawsweeper re-review.
  • mantis: telegram-visible-proof: Mantis should capture Telegram visible proof. The change is user-visible channel behavior, and the linked issue includes Telegram sessions where a short live transcript can show the recovery or overflow error instead of silence.

Full review comments:

  • [P1] Preserve the recovery reason when aborting — src/agents/pi-embedded-runner/runs.ts:521
    This passes params.reason into handle.abort, but the active embedded handle still points abort at abortRun(isTimeout = false, reason?). Passing "stuck_recovery" therefore sets the timeout flag and leaves AbortSignal.reason as a timeout object, so attempt.abortReason never equals "stuck_recovery" and the synthesized recovery payload will not fire. Wrap the handle as abort: (reason?: unknown) => abortRun(false, reason) or equivalent.
    Confidence: 0.96

Overall correctness: patch is incorrect
Overall confidence: 0.93

What I checked:

  • Abort reason is passed through the changed path: The PR head changes abortAndDrainEmbeddedPiRun to call handle.abort(params.reason) when a reason is present; this is the changed call that is supposed to carry stuck_recovery downstream. (src/agents/pi-embedded-runner/runs.ts:521, facafffb0b47)
  • Active embedded handle still maps abort to timeout-first function: The same PR head still assigns abort: abortRun, while abortRun is defined as (isTimeout = false, reason?: unknown), so handle.abort("stuck_recovery") lands in the timeout flag slot rather than the reason slot. (src/agents/pi-embedded-runner/run/attempt.ts:3314, facafffb0b47)
  • Synthesized recovery message depends on the lost string reason: The PR only emits the new stuck-session payload when attempt.abortReason === "stuck_recovery", so the timeout-first abort mapping prevents this terminal payload from being produced. (src/agents/pi-embedded-runner/run.ts:3164, facafffb0b47)
  • Current main recovery path supplies the reason the PR is trying to preserve: Current main's stuck-session recovery calls abortAndDrainEmbeddedPiRun with reason: "stuck_recovery", and the existing abort/drain implementation accepts a reason field but currently aborts through abortEmbeddedPiRun without passing it to the active handle. (src/logging/diagnostic-stuck-session-recovery.runtime.ts:121, a04566da11ce)
  • Actual PR head does not include the claimed wrapper/test follow-up: The PR body/comment claims a fix commit and a runs.test.ts regression test, but git show --stat for the live head only shows the six source files and still has no runs.test.ts change or abort wrapper change. (facafffb0b47)
  • History provenance for related code: Current-main blame points the abort/drain, queue-handle, and stuck-recovery lines at recent embedded-runner history, and git log -S hasPayloadText traces the overflow fallback to the prior silent-overflow fix work. (src/auto-reply/reply/agent-runner-execution.ts:2496, eb73e87f18d1)

Likely related people:

  • steipete: git log -S hasPayloadText traces the context-overflow fallback to prior silent-overflow fix work, with recent nearby no-reply timeout changes also in this area. (role: overflow fallback introducer and recent area contributor; confidence: high; commits: eb73e87f18d1, e510042870cf; files: src/auto-reply/reply/agent-runner-execution.ts, src/agents/pi-embedded-runner/runs.ts)
  • Gio Della-Libera: Current-main blame in this checkout points the embedded abort/drain, queue-handle, and stuck-recovery lines at a recent source-state commit, making them a useful routing candidate for this path. (role: recent area contributor; confidence: medium; commits: f7c05dcc9e51; files: src/agents/pi-embedded-runner/runs.ts, src/agents/pi-embedded-runner/run/attempt.ts, src/logging/diagnostic-stuck-session-recovery.runtime.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against a04566da11ce.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. mantis: telegram-visible-proof Mantis should capture Telegram visible proof. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels May 20, 2026
@clawsweeper

clawsweeper Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@lml2468

lml2468 commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Fixed the P1 abort reason contract bug: queueHandle.abort was wired directly to abortRun(isTimeout, reason?), causing handle.abort("stuck_recovery") to set isTimeout=true with reason=undefined. Changed to (reason?) => abortRun(false, reason) so the reason flows through to AbortSignal.reason correctly.

Also added a regression test in runs.test.ts covering reason propagation through the abort chain.

@clawsweeper

clawsweeper Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling mantis: telegram-visible-proof Mantis should capture Telegram visible proof. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: S status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Preemptive context overflow silently kills embedded sessions without notifying user

2 participants