fix(agents): rethrow EmbeddedAttemptSessionTakeoverError before model fallback by cael-dandelion-cult · Pull Request #83436 · openclaw/openclaw

cael-dandelion-cult · 2026-05-18T04:37:39Z

Summary

EmbeddedAttemptSessionTakeoverError (the embedded-attempt fingerprint guard that fires when the session JSONL changes during the released-lock window) is currently routed through runWithModelFallback as an ordinary candidate failure. Every model candidate fails with the same takeover error, the chain exhausts, the last fallback succeeds, and the session ends up effectively pinned to that fallback model — with no provider error and no normal fallback receipt.

This PR extends the existing carve-out family (isCommandLaneTaskTimeoutError, shouldRethrowAbort) to also rethrow EmbeddedAttemptSessionTakeoverError and SessionWriteLockTimeoutError before fallback classification.

Repro

Observed on Discord-channel-bound sessions under normal traffic. journalctl shows:

[diagnostic] lane task error: lane=main durationMs=... error="EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: .../sessions/<id>.jsonl"
[model-fallback/decision] decision=candidate_failed
  requested=github-copilot/claude-opus-4.7-1m-internal
  candidate=github-copilot/claude-opus-4.7-1m-internal
  reason=unknown
  next=github-copilot/claude-opus-4.6
  detail=session file changed while embedded prompt lock was released
... chain repeats for each fallback ...
[model-fallback/decision] decision=candidate_succeeded candidate=openai-codex/gpt-5.5

Then the successful fallback persists as modelOverrideSource: auto. Manual re-pins decay back to the same fallback within minutes.

Four complete chain-walks observed on one seat between 20:13 and 20:30 PDT, cadence ~5 min.

Why this presents as silent reassertion

EmbeddedAttemptSessionTakeoverError is a local session-state guard (introduced in Release embedded session write lock before model I/O #82891). It is not a provider failure.
describeFailoverError() does not classify it as a provider-class failure, so [model-fallback/decision] logs reason=unknown. The clue only survives in detail=....
Manual model re-pins are not stale state — they decay because the next attempt re-hits the takeover race.
Seats with quieter session-file write cadence (no concurrent JSONL writes during the lock-release window) do not trigger the race and hold canon.

Relation to #66646 / #78633

Issue #66646 (Session file lock errors cascade through model fallback chain) was fixed by #78633, which exported hasSessionWriteLockTimeout and the carve-out narrative. However, the runtime intercept for runWithModelFallback was not retained on main: hasSessionWriteLockTimeout is only referenced from failover-error.ts describe/classify paths, not from the fallback runner itself. EmbeddedAttemptSessionTakeoverError (introduced by #82891) extends the same family but does not yet have an interceptor anywhere in the model-fallback chain. This PR adds both interceptors in runFallbackCandidate.

Change

+function isEmbeddedAttemptSessionTakeoverError(err: unknown): boolean { ... }
+function shouldRethrowSessionOwnershipError(err: unknown): boolean {
+  return isEmbeddedAttemptSessionTakeoverError(err) || isSessionWriteLockTimeoutError(err);
+}

 // in runFallbackCandidate
 if (isCommandLaneTaskTimeoutError(err)) {
   throw err;
 }
+if (shouldRethrowSessionOwnershipError(err)) {
+  throw err;
+}

The pattern matches the existing isCommandLaneTaskTimeoutError carve-out exactly.

Tests

Added regression in src/agents/model-fallback.test.ts:

session takeover error rethrown before second fallback candidate is invoked
session takeover error does not call onError (the candidate-failed receipt path)

Local verification on this branch:

pnpm vitest run src/agents/model-fallback.test.ts -t "session takeover|command-lane watchdog|unrecognized errors" → 6 passed / 108 skipped (114 total)
pnpm vitest run src/agents/model-fallback.test.ts → 2 files / 114 tests passed

Search key for future operators

session file changed while embedded prompt lock was released

This is the signature in journalctl. With this fix it stays a session-takeover error and is handled by the embedded runner / retry path instead of poisoning model fallback.

Out of scope (follow-up lanes)

[model-fallback/decision] reason=unknown should preserve the underlying error class in the decision-line reason field. The detail string carries it, but the reason field collapses to unknown. Surfacing-(2) prevents the next masked-as-unknown bug from being invisible.
Identifying which writer concurrently mutates the session JSONL during the embedded-prompt-lock-release window. The race itself is reduced (Release embedded session write lock before model I/O #82891 releases the lock before model I/O) but takeover still fires under traffic.

cc @sallyom (author of #78633, related fix)

clawsweeper · 2026-05-18T04:38:54Z

Codex review: needs real behavior proof before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The branch adds a model-fallback carve-out for embedded session takeover and session write-lock timeout errors, plus a regression test for the embedded takeover path.

Reproducibility: yes. source-level reproduction is high-confidence: current main has no embedded takeover carve-out before fallback classification, and a multi-candidate fallback run whose first callback throws the takeover error will record a candidate failure instead of rethrowing. I did not reproduce the reported live Discord cadence in this read-only review.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🦐 gold shrimp
Summary: The branch targets the right runtime path, but a wrapped-lock correctness gap and missing after-fix real behavior proof keep it from being merge-ready.

Rank-up moves:

Replace the direct write-lock check with recursive session-ownership detection and add a wrapped-lock fallback regression test.
Add redacted after-fix proof from a Discord-bound or embedded-session run showing no repeated model-fallback candidate failures and no persisted auto fallback model.
Update the PR body with the proof so a fresh ClawSweeper review can run automatically, or ask a maintainer to comment @clawsweeper re-review if it does not.

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body has pre-fix journalctl observations and unit-test commands, but no after-fix real Discord or embedded-session proof; a redacted terminal log, copied live output, screenshot/recording, or linked artifact should be added, with private details such as IPs, keys, phone numbers, and non-public endpoints removed. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge
Why this matters: - Merging as-is leaves wrapped SessionWriteLockTimeoutError cases able to walk configured fallback candidates, which is the same failure shape the prior related fix explicitly covered.

The behavior changes local session-ownership failures from candidate-failed fallback receipts into immediate throw paths, so the affected embedded or Discord-bound retry/receipt path still needs runtime proof.
The contributor proof is currently pre-fix logs plus unit-test output; there is no after-fix real run showing the reported auto fallback model pin no longer persists.

Maintainer options:

Fix detector and prove runtime behavior (recommended)
Reuse or export the recursive session-lock detector, add a wrapped-lock fallback regression test, and require redacted after-fix Discord or embedded-session proof before merge.
Accept direct-only coverage temporarily
Maintainers can merge as-is only if they intentionally accept that wrapped session-lock timeouts may still walk the fallback chain until a follow-up lands.
Pause if runtime proof is unavailable
If the reported Discord or embedded-session path cannot be shown after the patch, hold or close this PR rather than landing a unit-test-only session-state change.

Next step before merge
This needs contributor real behavior proof and a source-branch code/test revision; ClawSweeper should not open a repair lane while the proof gate remains contributor-owned.

Security
Cleared: The diff only changes local error classification and a regression test, with no dependency, CI, script, secret, permission, or supply-chain surface added.

Review findings

[P2] Preserve wrapped session-lock detection — src/agents/model-fallback.ts:134

Review details

Best possible solution:

Use a recursive session-ownership detector at the fallback boundary, keep the embedded takeover fail-fast path, add wrapped-lock regression coverage, and land only with redacted after-fix runtime proof.

Do we have a high-confidence way to reproduce the issue?

Yes, source-level reproduction is high-confidence: current main has no embedded takeover carve-out before fallback classification, and a multi-candidate fallback run whose first callback throws the takeover error will record a candidate failure instead of rethrowing. I did not reproduce the reported live Discord cadence in this read-only review.

Is this the best way to solve the issue?

No, not yet. The embedded takeover direction is narrow and appropriate, but the write-lock portion should preserve the recursive wrapped-error handling from the prior related fix before this is the best mergeable solution.

Full review comments:

[P2] Preserve wrapped session-lock detection — src/agents/model-fallback.ts:134
This helper only recognizes direct SessionWriteLockTimeoutError values. The prior merged fix used a recursive detector, and current failover tests still cover lock errors nested under cause or reason; those wrappers can fall through as ordinary unknown candidate failures and still walk the fallback chain. Please reuse or export the recursive detector here and add a wrapped-lock fallback regression test.
Confidence: 0.87

Overall correctness: patch is incorrect
Overall confidence: 0.84

Acceptance criteria:

node scripts/run-vitest.mjs src/agents/model-fallback.test.ts src/agents/failover-error.test.ts
Redacted Discord-bound or embedded-session run showing EmbeddedAttemptSessionTakeoverError and wrapped SessionWriteLockTimeoutError do not walk model fallbacks or persist an auto fallback model

What I checked:

Current fallback catch path lacks the ownership carve-out: On current main, runFallbackCandidate only rethrows command-lane task timeouts before coerceToFailoverError, so embedded takeover or session-lock errors can still be returned as candidate failures and recorded by the fallback loop. (src/agents/model-fallback.ts:247, e3d55188384b)
Embedded takeover is a local session ownership guard: EmbeddedAttemptSessionTakeoverError is defined with the exact session-file-changed message from the PR report, and it is raised by the embedded session-lock fence when the JSONL fingerprint changes after prompt-lock release. (src/agents/pi-embedded-runner/run/attempt.session-lock.ts:167, e3d55188384b)
Current classifier preserves recursive lock detection: hasSessionWriteLockTimeout walks error, cause, and reason recursively, and adjacent tests cover SessionWriteLockTimeoutError nested under cause or reason. (src/agents/failover-error.ts:218, e3d55188384b)
Prior merged fix used the recursive detector at the fallback boundary: Merged PR fix(agents): fail fast on session lock fallback #78633 added hasSessionWriteLockTimeout(normalized) in model-fallback.ts so wrapped session write-lock errors failed fast instead of walking configured fallbacks. (src/agents/model-fallback.ts:1049, a74894a95401)
PR implementation is direct-only for write-lock timeouts: The PR's shouldRethrowSessionOwnershipError combines embedded takeover detection with isSessionWriteLockTimeoutError(err), but does not reuse the recursive hasSessionWriteLockTimeout path, and the added test covers embedded takeover only. (src/agents/model-fallback.ts:134, f95dc0a85649)
Real behavior proof is still missing: The PR body includes pre-fix journalctl observations and unit-test commands, but no after-fix Discord-bound or embedded-session run showing the fallback chain no longer walks and no auto model pin persists. (f95dc0a85649)

Likely related people:

sallyom: Authored merged PR fix(agents): fail fast on session lock fallback #78633, which added the previous recursive session-lock fallback guard and tests for failing fast on session write-lock timeout. (role: related prior fix author; confidence: high; commits: a74894a95401; files: src/agents/model-fallback.ts, src/agents/model-fallback.test.ts, src/agents/failover-error.ts)
amknight: Authored merged PR Release embedded session write lock before model I/O #82891, which introduced EmbeddedAttemptSessionTakeoverError and the prompt-lock-release session fence this PR handles. (role: embedded session-lock feature owner; confidence: high; commits: 8a060b2904d4; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts, src/agents/pi-embedded-runner/run/attempt.ts)
vincentkoc: Recent current-main agent fallback work touched model-fallback.ts and model-fallback.test.ts around harness/fallback behavior, adjacent to this PR's runtime path. (role: recent model-fallback area contributor; confidence: medium; commits: 8f27b3e21f6b, e3d55188384b; files: src/agents/model-fallback.ts, src/agents/model-fallback.test.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against e3d55188384b.

…unner-owned shape ClawSweeper review on openclaw#84046 flagged the prior count-only check as too broad: concurrent transcript rewrites, file replacements, branch changes, and same-user-count mutations could all keep the user-message count unchanged while changing session state. Tighten the fence disambiguation so a stat-fingerprint mismatch is only tolerated when ALL of these hold: - dev/ino match (no atomic replacement onto a new inode) - birthtimeNs matches (no unlink+recreate-same-inode case; common on tmpfs/ext4 with rapid recreation, where dev and ino are reused) - size grew (file is append-only between fence-set and assert) - bytes [0, fenceSize) hash identical to the fenced prefix hash (no in-place rewrite of earlier transcript content) - bytes [fenceSize, currentSize) contain only message entries with role assistant or tool (no new user-role turn, no malformed tail) Anything else — replacement, shrink, in-place rewrite, new user turn, or compaction-style mutation — remains a real takeover. Snapshot now captures the prefix hash alongside the fingerprint at releaseForPrompt and refreshSessionFileFence so the slow path can verify rigorously. Adds tests covering in-place rewrite and unlink+recreate replacement (both with prefix bytes coincidentally matching and roles that would otherwise look runner-owned); existing append-only and new-user-turn tests preserved. Refs: openclaw#83436, openclaw#83615 Addresses ClawSweeper P1 finding on prior revision of openclaw#84046.

…t non-message tail entries ClawSweeper re-review on openclaw#84046 flagged two P1 issues on the prior revision of the append-only fence: - The role allowlist used 'tool', but OpenClaw persists tool outputs as message.role === 'toolResult' (see src/agents/pi-embedded-runner/transcript-file-state.ts:182). Real tool-using turns would still false-trip the takeover. Update the allowlist to match isAgentMessage's runner-owned roles: assistant, toolResult, bashExecution. - The slow path silently 'continue'd through non-message tail entries (custom, branch_summary, compaction, session). That let a different owner mutate session state during the released-lock window without a new user turn. Fail closed: any non-message entry in the tail rejects the fence. Test updates: existing acceptance test now appends assistant + toolResult + bashExecution (the three valid persisted runner-owned roles); new test asserts a 'type: custom' tail entry trips takeover even though it grew append-only. Refs: openclaw#83436, openclaw#83615 Addresses ClawSweeper P1 findings on second revision of openclaw#84046.

…l entries ClawSweeper P2 finding on the prior revision: the slow path accepted any 'type: message' entry with role in {assistant, toolResult, bashExecution} without checking the full persisted message contract. Malformed or externally-synthesized entries carrying an accepted role would advance the fence over state they do not actually represent. Add a local isPersistedRunnerOwnedMessage validator that mirrors isAgentMessage in transcript-file-state.ts for the three accepted roles: assistant needs content present; toolResult needs string toolCallId/toolName, boolean isError, array content; bashExecution needs string command/output, boolean cancelled/truncated, and number-or-undefined exitCode. Reject everything else. Tests: existing acceptance test updated to use array content for toolResult (matching the persisted schema); new test asserts a toolResult message missing toolName/isError/content trips takeover even though role is accepted. Refs: openclaw#83436, openclaw#83615 Addresses ClawSweeper P2 finding on third revision of openclaw#84046.

ClawSweeper P2 finding on prior revision: the local validator was a hand-rolled subset of isAgentMessage in transcript-file-state.ts, and each round of review surfaced a new schema gap (assistant content just had to exist; toolResult content just had to be an array; bashExecution missed nothing here yet). Reproducing the contract locally is the wrong layer — transcript-file-state.ts already owns the canonical persisted message contract. Export isAgentMessage from transcript-file-state.ts and call it from the fence slow path, then gate on the runner-owned role subset (assistant, toolResult, bashExecution). user-role and custom-role messages remain real takeover signals even when their contract is valid. The local helper drops to ten lines: full contract check via the canonical validator, then role-subset gate. No new tests needed — the existing malformed-toolResult regression already covers the schema mismatch the canonical validator now catches more strictly (missing toolName/isError + non-array content). Refs: openclaw#83436, openclaw#83615 Addresses ClawSweeper P2 finding on fourth revision of openclaw#84046.

fix: keep session takeover out of model fallback

f95dc0a

openclaw-barnacle Bot added agents Agent runtime and tooling size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 18, 2026

silas-dandelion-cult mentioned this pull request May 18, 2026

model-fallback decision telemetry: surface override-mutation cause instead of reason=unknown karmaterminal/openclaw#706

Open

cael-dandelion-cult mentioned this pull request May 18, 2026

Cherry-pick upstream #79678 (model-failback-ttl-30s) to fleet — dissolves codex/gpt snapback disease karmaterminal/openclaw#708

Open

kays0x mentioned this pull request May 19, 2026

fix(agents): stop false-positive session-takeover on runner's own transcript appends #84046

Closed

fanyangCS mentioned this pull request May 20, 2026

EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released #84059

Closed

clawsweeper Bot mentioned this pull request May 25, 2026

[Bug]: EmbeddedAttemptSessionTakeoverError during Discord runs: session file changed while embedded prompt lock was released #86508

Open

osolmaz mentioned this pull request Jun 10, 2026

Track core session/transcript SQLite migration via accessor seam #88838

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): rethrow EmbeddedAttemptSessionTakeoverError before model fallback#83436

fix(agents): rethrow EmbeddedAttemptSessionTakeoverError before model fallback#83436
cael-dandelion-cult wants to merge 1 commit into
openclaw:mainfrom
cael-dandelion-cult:cael/upstream-session-takeover-no-fallback

cael-dandelion-cult commented May 18, 2026

Uh oh!

clawsweeper Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cael-dandelion-cult commented May 18, 2026

Summary

Repro

Why this presents as silent reassertion

Relation to #66646 / #78633

Change

Tests

Search key for future operators

Out of scope (follow-up lanes)

Uh oh!

clawsweeper Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

clawsweeper Bot commented May 18, 2026 •

edited

Loading