Skip to content

fix(auto-reply): preserve post-compaction failure context in user-facing recovery message (#67750)#87691

Open
sweetcornna wants to merge 1 commit into
openclaw:mainfrom
sweetcornna:fix/67750-compaction-embedded-timeout
Open

fix(auto-reply): preserve post-compaction failure context in user-facing recovery message (#67750)#87691
sweetcornna wants to merge 1 commit into
openclaw:mainfrom
sweetcornna:fix/67750-compaction-embedded-timeout

Conversation

@sweetcornna

Copy link
Copy Markdown
Contributor

Summary

What problem does this PR solve?

When auto-compaction succeeds inside a turn and the retried run still fails (e.g. LLM idle timeout (120s), fallback exhausted with billing-skipped candidates), runAgentTurnWithFallback's catch path collapses the failure into either the generic ⚠️ Something went wrong while processing your request. Please try again, or use /new to start a fresh session. copy or a bare cause-specific message (e.g. BILLING_ERROR_USER_MESSAGE). Either way, the user loses the fact that compaction was attempted, that the timeout was on the retried turn, and which fallback candidates failed for which reason — exactly the chain in the bug report at src/auto-reply/reply/agent-runner-execution.ts.

Why does this matter now?

#67750 was filed against the user-facing chain in src/auto-reply/reply/agent-runner-execution.ts. The same catch block is on the hot path for every messaging surface (embedded harness runs and Codex / Pi-style CLI dispatch), so the same erasure happens across channels.

What is the intended outcome?

The user-facing message preserves the post-compaction signal already tracked in autoCompactionCount and the per-attempt reason chain already carried on FallbackSummaryError.attempts. Cause-specific copy (billing, rate-limit, timeout, oauth) is preserved verbatim so existing actionable hints still surface; the generic catch-all gains a structured Auto-compaction succeeded, but the retried turn still failed — ... prefix plus a concrete next step instead of bare /new.

What is intentionally out of scope?

  • No changes to the model-fallback engine (src/agents/model-fallback.ts) or to FallbackSummaryError's shape.
  • No changes to the embedded harness compaction lifecycle (src/agents/embedded-agent-subscribe.ts) — the attemptCompactionCount increment in the outer execution layer (src/auto-reply/reply/agent-runner-execution.ts:2535,2631) is already the source of truth.
  • Heartbeat (HEARTBEAT_EXTERNAL_RUN_FAILURE_TEXT) and control-UI surfaces keep their existing copy contracts.

What does success look like?

Three new vitest cases under runAgentTurnWithFallback cover: (1) post-compaction FallbackSummaryError with timeout + billing-skip attempts surfaces both compaction context and the per-attempt summary; (2) post-compaction plain failure surfaces the structured prefix plus the generic-mode guidance; (3) failure without a successful compaction still produces the plain GENERIC_EXTERNAL_RUN_FAILURE_TEXT.

What should reviewers focus on?

  • The wrap is gated on autoCompactionCount > 0 && !params.isHeartbeat && !shouldSurfaceToControlUi, so existing copy tests (rate-limit, billing-only, gateway-restart, Codex usage-limit, missing-API-key, CLI subprocess timeout) keep their exact strings.
  • The attempt summary cap (POST_COMPACTION_ATTEMPT_SUMMARY_MAX_CHARS = 240) is sized to keep the prefix concise on transports with text-length caps.

Linked context

Which issue does this close?

Closes #67750

Which issues, PRs, or discussions are related?

Related #67750

Was this requested by a maintainer or owner?

No — addressing the labelled P2 / clawsweeper:fix-shape-clear / clawsweeper:queueable-fix / clawsweeper:source-repro triage on #67750.

Real behavior proof (required for external PRs)

  • Behavior or issue addressed: Post-auto-compaction LLM idle timeout collapses the cause chain into the generic /new fallback (or a bare BILLING_ERROR_USER_MESSAGE) and erases the compaction-succeeded signal that is already tracked in autoCompactionCount and on FallbackSummaryError.attempts.
  • Real environment tested: Source-level repro via vitest mocks of the inner runEmbeddedAgent + runWithModelFallback seam that drive the exact failure shape from the bug report (embedded run emits compaction phase=end completed=true, then throws an LLM idle timeout; fallback layer rejects with a FallbackSummaryError containing reason: "timeout" plus reason: "billing" skip entries). The clawsweeper:source-repro label on the issue already covers the source-of-truth chain at src/auto-reply/reply/agent-runner-execution.ts:2660-2811.
  • Exact steps or command run after this patch: node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-execution.test.ts (149 tests, all passing). pnpm tsgo:core and pnpm tsgo:core:test (clean). pnpm format:check src/auto-reply/reply/agent-runner-execution.ts src/auto-reply/reply/agent-runner-execution.test.ts (clean). node scripts/run-oxlint.mjs src/auto-reply/reply/agent-runner-execution.ts src/auto-reply/reply/agent-runner-execution.test.ts (clean).
  • Evidence after fix: Test surfaces post-compaction context when the retried turn times out (#67750) asserts the rendered text contains Auto-compaction succeeded, retains the cause-specific billing copy, and includes the per-attempt summary openai-codex/gpt-5.4 timed out plus anthropic/claude-opus-4-7 skipped (billing). Test surfaces post-compaction context for plain failures without a fallback summary asserts the generic-mode guidance ("The context was compacted but no candidate could finish the turn ...") instead of the bare /new copy.
  • Observed result after fix: Vitest output: Test Files 1 passed (1) / Tests 149 passed (149).
  • What was not tested: No live LLM timeout repro against a real provider. The issue is labelled clawsweeper:source-repro precisely because the failure chain is observable in source plus emitted log lines (see the verbatim repro evidence in the issue body), and the diff exercises the exact autoCompactionCount > 0 && FallbackSummaryError seam from that chain.
  • Proof limitations or environment constraints: Triggering the live 120s LLM idle timeout would require either an actual provider hang or a configured billing-disabled fallback chain, neither of which is reliable to set up without disturbing other users. The source-level mocks reproduce the exact error shape from the repro log.
  • Before evidence (optional but encouraged): Before this patch, src/auto-reply/reply/agent-runner-execution.ts:2795-2805 would resolve fallbackText to either BILLING_ERROR_USER_MESSAGE (when hasBillingAttemptSummary is true) or GENERIC_EXTERNAL_RUN_FAILURE_TEXT (when no specific classifier matches) — both of which drop autoCompactionCount and the FallbackSummaryError.attempts cause chain on the floor.

Tests and validation

Which commands did you run?

  • node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-execution.test.ts (149 passed)
  • node scripts/run-vitest.mjs src/auto-reply/reply/agent-runner-execution.test.ts src/auto-reply/reply/agent-runner-helpers.test.ts src/auto-reply/reply/agent-runner.misc.runreplyagent.test.ts src/auto-reply/reply/agent-runner.runreplyagent.e2e.test.ts (194 passed)
  • node scripts/run-vitest.mjs src/auto-reply/reply/followup-runner.test.ts (55 passed — sibling-surface check; followup-runner shares autoCompactionCount but uses a different non-user-facing failure path)
  • pnpm tsgo:core and pnpm tsgo:core:test (clean)
  • pnpm format:check (clean)

What regression coverage was added or updated?

Three new cases in src/auto-reply/reply/agent-runner-execution.test.ts under describe("runAgentTurnWithFallback"):

  1. surfaces post-compaction context when the retried turn times out (#67750) — full repro of the bug-report failure chain (compaction succeeds; retried turn timeout; fallback exhausted with timeout + billing-skip attempts). Asserts the rendered text preserves compaction context, the resolved cause-specific copy, and the per-attempt summary.
  2. surfaces post-compaction context for plain failures without a fallback summary — covers the generic catch-all branch when the error is a plain Error (no FallbackSummaryError), confirming the structured prefix and the specific next-step guidance replace the bare /new fallback.
  3. does not add post-compaction context when no compaction succeeded — regression guard so the wrap stays scoped to autoCompactionCount > 0 and ordinary unknown errors still resolve to GENERIC_EXTERNAL_RUN_FAILURE_TEXT.

What failed before this fix, if known?

Test #1 above would have produced result.payload.text === "billing" (the mocked BILLING_ERROR_USER_MESSAGE), with no mention of compaction or the per-attempt timeout. Test #2 would have produced result.payload.text === GENERIC_RUN_FAILURE_TEXT, with no mention of compaction succeeding.

If no test was added, why not?

N/A — tests added.

Risk checklist

Did user-visible behavior change? (Yes/No)

Yes — only on the post-auto-compaction failure path. The pre-compaction failure path (autoCompactionCount === 0) returns the existing fallbackText unchanged, and existing tests for billing, rate-limit, gateway-restart, Codex usage-limit, missing-API-key, CLI subprocess timeout, and the generic copy all still assert their exact strings.

Did config, environment, or migration behavior change? (Yes/No)

No.

Did security, auth, secrets, network, or tool execution behavior change? (Yes/No)

No.

What is the highest-risk area?

User-facing copy on the post-compaction failure path. Wrong wording on a hot error path can confuse users about whether the message they sent was processed.

How is that risk mitigated?

  • The wrap is gated on three independent predicates (autoCompactionCount > 0, !params.isHeartbeat, !shouldSurfaceToControlUi), so heartbeat probes and the control-UI logs path keep their existing copy.
  • Cause-specific resolved text (BILLING_ERROR_USER_MESSAGE, rate-limit cooldown, Codex usage-limit, missing-API-key, OAuth re-auth, CLI timeout) is preserved verbatim — the wrap only prepends/appends, never replaces.
  • All 149 existing agent-runner-execution.test.ts cases still pass (including every surfaces X copy contract).

Current review state

What is the next action?

Maintainer review.

What is still waiting on author, maintainer, CI, or external proof?

CI run on the new commit.

Which bot or reviewer comments were addressed?

None yet — first revision.

@openclaw-barnacle openclaw-barnacle Bot added size: M triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels May 28, 2026
@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed June 3, 2026, 4:28 AM ET / 08:28 UTC.

Summary
This PR wraps post-auto-compaction terminal agent-run failures with compaction context and structured fallback-attempt summaries, plus regression tests for post-compaction and no-compaction paths.

PR surface: Source +104, Tests +143. Total +247 across 2 files.

Reproducibility: yes. source-level: current main tracks completed compaction and structured fallback attempts but can still resolve a post-compaction mixed timeout/billing failure to billing or generic copy without compaction context; no live timeout was reproduced in this read-only review.

Review metrics: 1 noteworthy metric.

  • PR surface: 2 files changed, +248/-1. The change is bounded to the auto-reply failure formatter and its adjacent tests, which keeps review scope focused.

Merge readiness
Overall: 🦪 silver shellfish
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Add redacted real gateway/channel output showing the new post-compaction recovery message, or get an explicit maintainer proof override for the source-repro exception.

Proof guidance:

  • [P1] Needs real behavior proof before merge: The PR body provides Vitest/source-level mock proof and command output, but no after-fix proof from a real gateway or channel setup. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • [P1] The contributor proof is source-level mocks and command output only; no redacted real gateway or channel output demonstrates the final post-compaction recovery message after this patch.
  • [P1] The changed path is user-facing failure copy across messaging surfaces, so maintainers should decide whether the source-repro label is enough to override the normal external-PR real-behavior proof gate.

Maintainer options:

  1. Decide the mitigation before merge
    Land the narrow formatter and regression tests after real behavior proof or a maintainer proof override confirms the post-compaction recovery message in an actual gateway/channel path.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

  • [P1] A maintainer must either request real behavior proof from the contributor or explicitly accept the source-repro proof limitation before merge; this should not queue an automated repair marker for proof.

Security
Cleared: The diff changes TypeScript source and tests only, with no dependency, workflow, secret, install, package, or code-execution surface changes.

Review details

Best possible solution:

Land the narrow formatter and regression tests after real behavior proof or a maintainer proof override confirms the post-compaction recovery message in an actual gateway/channel path.

Do we have a high-confidence way to reproduce the issue?

Yes, source-level: current main tracks completed compaction and structured fallback attempts but can still resolve a post-compaction mixed timeout/billing failure to billing or generic copy without compaction context; no live timeout was reproduced in this read-only review.

Is this the best way to solve the issue?

Yes, as a narrow fix: the formatter is the right owner boundary because the fallback engine already exposes structured attempts and the issue is about final user-facing recovery copy, not fallback selection or session reset policy.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 34c3827290a7.

Label changes

Label justifications:

  • P2: This is a normal-priority bugfix for misleading long-session post-compaction failure reporting with limited implementation scope.
  • rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🦪 silver shellfish and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body provides Vitest/source-level mock proof and command output, but no after-fix proof from a real gateway or channel setup. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
Evidence reviewed

PR surface:

Source +104, Tests +143. Total +247 across 2 files.

View PR surface stats
Area Files Added Removed Net
Source 1 105 1 +104
Tests 1 143 0 +143
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 2 248 1 +247

What I checked:

  • Current main tracks completed compaction state: runAgentTurnWithFallback initializes autoCompactionCount and accumulates per-attempt compaction counts, so the failure formatter has local state available but does not currently use it in user-facing terminal failures. (src/auto-reply/reply/agent-runner-execution.ts:1491, 34c3827290a7)
  • Current main drops compaction context on terminal fallback failures: The catch path classifies mixed FallbackSummaryError failures as billing or generic text, then returns userVisibleFallbackText without passing autoCompactionCount or a post-compaction flag into the message. (src/auto-reply/reply/agent-runner-execution.ts:2782, 34c3827290a7)
  • Structured fallback attempts are available: FallbackSummaryError carries attempts, and the summary throw site preserves per-attempt provider, model, error, and reason data that the PR can consume without parsing the aggregate error string. (src/agents/model-fallback.ts:133, 34c3827290a7)
  • PR wraps the existing formatter rather than changing fallback selection: The head commit adds buildPostCompactionFailureRecoveryText and gates it on autoCompactionCount > 0, non-heartbeat, and non-control-UI conditions before the existing conversation policy resolver. (src/auto-reply/reply/agent-runner-execution.ts:657, 5b8a9e294839)
  • PR adds focused regression coverage: The added tests cover mixed timeout/billing fallback after completed compaction, plain post-compaction failure, and unchanged generic output when no compaction succeeded. (src/auto-reply/reply/agent-runner-execution.test.ts:4801, 5b8a9e294839)
  • Sibling followup path is not the same user-facing terminal formatter: followup-runner uses compaction count for session persistence and optional success notices, while its inspected path does not share the same final user-facing failure message construction as runAgentTurnWithFallback. (src/auto-reply/reply/followup-runner.ts:1111, 34c3827290a7)

Likely related people:

  • joshavant: Commit 1dac68c0bbe75a78df75d500ea86c7f13aca91bc changed auto-reply agent fallback provenance and tests near the same terminal fallback decision surface. (role: recent fallback-path contributor; confidence: high; commits: 1dac68c0bbe7; files: src/auto-reply/reply/agent-runner-execution.ts, src/auto-reply/reply/agent-runner-execution.test.ts)
  • dutifulbob: Commit 13c97c5a8d04897c84cbb4a82eeac502330cdee8 touched the relevant agent runner and fallback-adjacent area in a recent merged agents change. (role: recent area contributor; confidence: medium; commits: 13c97c5a8d04; files: src/auto-reply/reply/agent-runner-execution.ts, src/auto-reply/reply/agent-runner-execution.test.ts, src/agents/model-fallback.ts)
  • MertBasar0: Commit 029ca8c268848d9ff805afb7bc2830282f75bcbb added state-aware failover and lane suspension changes in model-fallback.ts, the source of the structured fallback attempts used here. (role: adjacent fallback contributor; confidence: medium; commits: 029ca8c26884; files: src/agents/model-fallback.ts)
  • vincentkoc: Current checkout history and live commit data show recent work touching agent-runner-execution.ts, though local blame is shallow/grafted and is weaker provenance for the original behavior. (role: recent adjacent contributor; confidence: medium; commits: 175cfe4846a1; files: src/auto-reply/reply/agent-runner-execution.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal backlog priority with limited blast radius. labels May 28, 2026
@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 29, 2026
@sweetcornna sweetcornna force-pushed the fix/67750-compaction-embedded-timeout branch from d2813b8 to 5b8a9e2 Compare June 3, 2026 08:19
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P2 Normal backlog priority with limited blast radius. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. size: M status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Successful auto-compaction can still end in a 120s embedded timeout and generic /new fallback

2 participants