fix(embedded-runner): preserve provider errors on cleanup takeover#84056
fix(embedded-runner): preserve provider errors on cleanup takeover#84056abnershang wants to merge 2 commits into
Conversation
|
Codex review: needs maintainer review before merge. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. at source level. Current main rejects cleanup takeover ahead of a prior prompt error, and the fallback classifier can normalize a provider-looking takeover wrapper unless the direct takeover identity is preserved. PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. PR egg Rarity: 🥚 common. What is this egg doing here?
Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the narrow error-precedence and fallback-abort fix after required checks if maintainers accept boundary-level proof; ask for live cleanup-race logs only if that proof bar is needed for this session-state path. Do we have a high-confidence way to reproduce the issue? Yes, at source level. Current main rejects cleanup takeover ahead of a prior prompt error, and the fallback classifier can normalize a provider-looking takeover wrapper unless the direct takeover identity is preserved. Is this the best way to solve the issue? Yes. The proposed fix keeps the provider message user-facing while carrying a takeover identity/cause through fallback, with focused tests for prompt-error preservation, cleanup-only takeover, and fallback abort behavior. Label justifications:
Acceptance criteria:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against f07c87405c30. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 82fb23685a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (shouldPreservePromptError) { | ||
| log.warn( | ||
| `embedded attempt cleanup detected session takeover after prompt failure; preserving prompt error: ` + | ||
| `runId=${params.runId} sessionId=${params.sessionId} ` + | ||
| `promptError=${formatErrorMessage(promptError)} cleanupError=${formatErrorMessage(cleanupError)}`, | ||
| ); | ||
| } else { | ||
| await Promise.reject(cleanupError); |
There was a problem hiding this comment.
Preserve takeover signal for fallback
When a provider prompt failure is classified for fallback (for example rate limits or no endpoints) and cleanup also detects EmbeddedAttemptSessionTakeoverError, this branch only logs and lets the attempt surface the provider error. The outer runWithModelFallback path explicitly aborts fallback on session takeover (src/agents/model-fallback.ts:1200-1207); after this change it only sees the provider error and can dispatch another model attempt into a session that has been taken over. Keep the user-facing provider error, but carry/throw the takeover signal so fallback cannot continue in that state.
Useful? React with 👍 / 👎.
|
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
@clawsweeper re-review |
|
@clawsweeper re-review |
|
@clawsweeper automerge |
|
🦞🔧
Draft PRs stay fix-only until GitHub marks them ready for review. Pause with Automerge progress:
|
|
ClawSweeper 🐠 reef update Thanks for the work here. ClawSweeper could not write to the source branch, so it opened a replacement PR rather than letting the fix drift. attribution still points back here. Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
fish notes: model gpt-5.5, reasoning high; reviewed against e7d9d8c. |
Summary
EmbeddedAttemptSessionTakeoverError.Real behavior proof
Behavior addressed: embedded runner cleanup can race with a session takeover after a provider prompt failure; the user-facing failure should remain the provider/prompt error, but fallback must still see the takeover signal and stop instead of dispatching another model attempt into a taken-over session.
Real environment tested: local OpenClaw source checkout in an isolated worktree on this branch, using the patched fallback classifier from the worktree root.
Exact steps or command run after the patch:
node --import tsx -e '<construct a provider-message wrapper with EmbeddedAttemptSessionTakeoverError identity/cause and pass it to isNonProviderRuntimeCoordinationError>' node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.spawn-workspace.context-engine.test.ts src/agents/model-fallback.test.ts pnpm tsgo:test:src pnpm check:changed git diff --check origin/main...HEADEvidence after fix: terminal output from the one-shot runtime fallback-classifier probe:
{ "wrapperName": "EmbeddedAttemptSessionTakeoverError", "userFacingMessage": "provider rejected request: HTTP 400", "causeName": "EmbeddedAttemptSessionTakeoverError", "abortsFallback": true }Observed result after fix: the wrapper preserves the provider-facing message while retaining the takeover identity/cause, and the fallback classifier returns
abortsFallback: truefor that runtime-shaped error.What was not tested: no live provider outage was forced against a production gateway. The runtime classifier probe exercises the non-mocked fallback boundary; the cleanup race itself is covered by the embedded-runner attempt harness that owns the cleanup/error precedence path.
Supplemental checks:
pnpm tsgo:test:srcpassed.pnpm check:changedpassed the core/coreTests lane, lint, typechecks, dependency guards, import-cycle guard, and related changed-file checks.git diff --check origin/main...HEADpassed.Review
Codex review initially flagged that preserving only the provider error could let fallback continue after session takeover. The second commit addresses that by carrying the takeover signal through fallback classification while keeping the provider message user-facing. ClawSweeper re-review now reports patch quality as acceptable; this body update uses the proof-field labels required by the repository gate.
AI-assisted: yes. I understand the change and verified the narrow error-precedence and fallback-abort behavior locally.