Skip to content

fix(agents): rethrow EmbeddedAttemptSessionTakeoverError before model fallback#83436

Open
cael-dandelion-cult wants to merge 1 commit into
openclaw:mainfrom
cael-dandelion-cult:cael/upstream-session-takeover-no-fallback
Open

fix(agents): rethrow EmbeddedAttemptSessionTakeoverError before model fallback#83436
cael-dandelion-cult wants to merge 1 commit into
openclaw:mainfrom
cael-dandelion-cult:cael/upstream-session-takeover-no-fallback

Conversation

@cael-dandelion-cult

Copy link
Copy Markdown

Summary

EmbeddedAttemptSessionTakeoverError (the embedded-attempt fingerprint guard that fires when the session JSONL changes during the released-lock window) is currently routed through runWithModelFallback as an ordinary candidate failure. Every model candidate fails with the same takeover error, the chain exhausts, the last fallback succeeds, and the session ends up effectively pinned to that fallback model — with no provider error and no normal fallback receipt.

This PR extends the existing carve-out family (isCommandLaneTaskTimeoutError, shouldRethrowAbort) to also rethrow EmbeddedAttemptSessionTakeoverError and SessionWriteLockTimeoutError before fallback classification.

Repro

Observed on Discord-channel-bound sessions under normal traffic. journalctl shows:

[diagnostic] lane task error: lane=main durationMs=... error="EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: .../sessions/<id>.jsonl"
[model-fallback/decision] decision=candidate_failed
  requested=github-copilot/claude-opus-4.7-1m-internal
  candidate=github-copilot/claude-opus-4.7-1m-internal
  reason=unknown
  next=github-copilot/claude-opus-4.6
  detail=session file changed while embedded prompt lock was released
... chain repeats for each fallback ...
[model-fallback/decision] decision=candidate_succeeded candidate=openai-codex/gpt-5.5

Then the successful fallback persists as modelOverrideSource: auto. Manual re-pins decay back to the same fallback within minutes.

Four complete chain-walks observed on one seat between 20:13 and 20:30 PDT, cadence ~5 min.

Why this presents as silent reassertion

  • EmbeddedAttemptSessionTakeoverError is a local session-state guard (introduced in Release embedded session write lock before model I/O #82891). It is not a provider failure.
  • describeFailoverError() does not classify it as a provider-class failure, so [model-fallback/decision] logs reason=unknown. The clue only survives in detail=....
  • Manual model re-pins are not stale state — they decay because the next attempt re-hits the takeover race.
  • Seats with quieter session-file write cadence (no concurrent JSONL writes during the lock-release window) do not trigger the race and hold canon.

Relation to #66646 / #78633

Issue #66646 (Session file lock errors cascade through model fallback chain) was fixed by #78633, which exported hasSessionWriteLockTimeout and the carve-out narrative. However, the runtime intercept for runWithModelFallback was not retained on main: hasSessionWriteLockTimeout is only referenced from failover-error.ts describe/classify paths, not from the fallback runner itself. EmbeddedAttemptSessionTakeoverError (introduced by #82891) extends the same family but does not yet have an interceptor anywhere in the model-fallback chain. This PR adds both interceptors in runFallbackCandidate.

Change

+function isEmbeddedAttemptSessionTakeoverError(err: unknown): boolean { ... }
+function shouldRethrowSessionOwnershipError(err: unknown): boolean {
+  return isEmbeddedAttemptSessionTakeoverError(err) || isSessionWriteLockTimeoutError(err);
+}

 // in runFallbackCandidate
 if (isCommandLaneTaskTimeoutError(err)) {
   throw err;
 }
+if (shouldRethrowSessionOwnershipError(err)) {
+  throw err;
+}

The pattern matches the existing isCommandLaneTaskTimeoutError carve-out exactly.

Tests

Added regression in src/agents/model-fallback.test.ts:

  • session takeover error rethrown before second fallback candidate is invoked
  • session takeover error does not call onError (the candidate-failed receipt path)

Local verification on this branch:

  • pnpm vitest run src/agents/model-fallback.test.ts -t "session takeover|command-lane watchdog|unrecognized errors" → 6 passed / 108 skipped (114 total)
  • pnpm vitest run src/agents/model-fallback.test.ts → 2 files / 114 tests passed

Search key for future operators

session file changed while embedded prompt lock was released

This is the signature in journalctl. With this fix it stays a session-takeover error and is handled by the embedded runner / retry path instead of poisoning model fallback.

Out of scope (follow-up lanes)

  • [model-fallback/decision] reason=unknown should preserve the underlying error class in the decision-line reason field. The detail string carries it, but the reason field collapses to unknown. Surfacing-(2) prevents the next masked-as-unknown bug from being invisible.
  • Identifying which writer concurrently mutates the session JSONL during the embedded-prompt-lock-release window. The race itself is reduced (Release embedded session write lock before model I/O #82891 releases the lock before model I/O) but takeover still fires under traffic.

cc @sallyom (author of #78633, related fix)

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 18, 2026
@clawsweeper

clawsweeper Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The branch adds a model-fallback carve-out for embedded session takeover and session write-lock timeout errors, plus a regression test for the embedded takeover path.

Reproducibility: yes. source-level reproduction is high-confidence: current main has no embedded takeover carve-out before fallback classification, and a multi-candidate fallback run whose first callback throws the takeover error will record a candidate failure instead of rethrowing. I did not reproduce the reported live Discord cadence in this read-only review.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🦐 gold shrimp
Summary: The branch targets the right runtime path, but a wrapped-lock correctness gap and missing after-fix real behavior proof keep it from being merge-ready.

Rank-up moves:

  • Replace the direct write-lock check with recursive session-ownership detection and add a wrapped-lock fallback regression test.
  • Add redacted after-fix proof from a Discord-bound or embedded-session run showing no repeated model-fallback candidate failures and no persisted auto fallback model.
  • Update the PR body with the proof so a fresh ClawSweeper review can run automatically, or ask a maintainer to comment @clawsweeper re-review if it does not.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body has pre-fix journalctl observations and unit-test commands, but no after-fix real Discord or embedded-session proof; a redacted terminal log, copied live output, screenshot/recording, or linked artifact should be added, with private details such as IPs, keys, phone numbers, and non-public endpoints removed. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge
Why this matters: - Merging as-is leaves wrapped SessionWriteLockTimeoutError cases able to walk configured fallback candidates, which is the same failure shape the prior related fix explicitly covered.

  • The behavior changes local session-ownership failures from candidate-failed fallback receipts into immediate throw paths, so the affected embedded or Discord-bound retry/receipt path still needs runtime proof.
  • The contributor proof is currently pre-fix logs plus unit-test output; there is no after-fix real run showing the reported auto fallback model pin no longer persists.

Maintainer options:

  1. Fix detector and prove runtime behavior (recommended)
    Reuse or export the recursive session-lock detector, add a wrapped-lock fallback regression test, and require redacted after-fix Discord or embedded-session proof before merge.
  2. Accept direct-only coverage temporarily
    Maintainers can merge as-is only if they intentionally accept that wrapped session-lock timeouts may still walk the fallback chain until a follow-up lands.
  3. Pause if runtime proof is unavailable
    If the reported Discord or embedded-session path cannot be shown after the patch, hold or close this PR rather than landing a unit-test-only session-state change.

Next step before merge
This needs contributor real behavior proof and a source-branch code/test revision; ClawSweeper should not open a repair lane while the proof gate remains contributor-owned.

Security
Cleared: The diff only changes local error classification and a regression test, with no dependency, CI, script, secret, permission, or supply-chain surface added.

Review findings

  • [P2] Preserve wrapped session-lock detection — src/agents/model-fallback.ts:134
Review details

Best possible solution:

Use a recursive session-ownership detector at the fallback boundary, keep the embedded takeover fail-fast path, add wrapped-lock regression coverage, and land only with redacted after-fix runtime proof.

Do we have a high-confidence way to reproduce the issue?

Yes, source-level reproduction is high-confidence: current main has no embedded takeover carve-out before fallback classification, and a multi-candidate fallback run whose first callback throws the takeover error will record a candidate failure instead of rethrowing. I did not reproduce the reported live Discord cadence in this read-only review.

Is this the best way to solve the issue?

No, not yet. The embedded takeover direction is narrow and appropriate, but the write-lock portion should preserve the recursive wrapped-error handling from the prior related fix before this is the best mergeable solution.

Full review comments:

  • [P2] Preserve wrapped session-lock detection — src/agents/model-fallback.ts:134
    This helper only recognizes direct SessionWriteLockTimeoutError values. The prior merged fix used a recursive detector, and current failover tests still cover lock errors nested under cause or reason; those wrappers can fall through as ordinary unknown candidate failures and still walk the fallback chain. Please reuse or export the recursive detector here and add a wrapped-lock fallback regression test.
    Confidence: 0.87

Overall correctness: patch is incorrect
Overall confidence: 0.84

Acceptance criteria:

  • node scripts/run-vitest.mjs src/agents/model-fallback.test.ts src/agents/failover-error.test.ts
  • Redacted Discord-bound or embedded-session run showing EmbeddedAttemptSessionTakeoverError and wrapped SessionWriteLockTimeoutError do not walk model fallbacks or persist an auto fallback model

What I checked:

  • Current fallback catch path lacks the ownership carve-out: On current main, runFallbackCandidate only rethrows command-lane task timeouts before coerceToFailoverError, so embedded takeover or session-lock errors can still be returned as candidate failures and recorded by the fallback loop. (src/agents/model-fallback.ts:247, e3d55188384b)
  • Embedded takeover is a local session ownership guard: EmbeddedAttemptSessionTakeoverError is defined with the exact session-file-changed message from the PR report, and it is raised by the embedded session-lock fence when the JSONL fingerprint changes after prompt-lock release. (src/agents/pi-embedded-runner/run/attempt.session-lock.ts:167, e3d55188384b)
  • Current classifier preserves recursive lock detection: hasSessionWriteLockTimeout walks error, cause, and reason recursively, and adjacent tests cover SessionWriteLockTimeoutError nested under cause or reason. (src/agents/failover-error.ts:218, e3d55188384b)
  • Prior merged fix used the recursive detector at the fallback boundary: Merged PR fix(agents): fail fast on session lock fallback #78633 added hasSessionWriteLockTimeout(normalized) in model-fallback.ts so wrapped session write-lock errors failed fast instead of walking configured fallbacks. (src/agents/model-fallback.ts:1049, a74894a95401)
  • PR implementation is direct-only for write-lock timeouts: The PR's shouldRethrowSessionOwnershipError combines embedded takeover detection with isSessionWriteLockTimeoutError(err), but does not reuse the recursive hasSessionWriteLockTimeout path, and the added test covers embedded takeover only. (src/agents/model-fallback.ts:134, f95dc0a85649)
  • Real behavior proof is still missing: The PR body includes pre-fix journalctl observations and unit-test commands, but no after-fix Discord-bound or embedded-session run showing the fallback chain no longer walks and no auto model pin persists. (f95dc0a85649)

Likely related people:

  • sallyom: Authored merged PR fix(agents): fail fast on session lock fallback #78633, which added the previous recursive session-lock fallback guard and tests for failing fast on session write-lock timeout. (role: related prior fix author; confidence: high; commits: a74894a95401; files: src/agents/model-fallback.ts, src/agents/model-fallback.test.ts, src/agents/failover-error.ts)
  • amknight: Authored merged PR Release embedded session write lock before model I/O #82891, which introduced EmbeddedAttemptSessionTakeoverError and the prompt-lock-release session fence this PR handles. (role: embedded session-lock feature owner; confidence: high; commits: 8a060b2904d4; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts, src/agents/pi-embedded-runner/run/attempt.ts)
  • vincentkoc: Recent current-main agent fallback work touched model-fallback.ts and model-fallback.test.ts around harness/fallback behavior, adjacent to this PR's runtime path. (role: recent model-fallback area contributor; confidence: medium; commits: 8f27b3e21f6b, e3d55188384b; files: src/agents/model-fallback.ts, src/agents/model-fallback.test.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against e3d55188384b.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. P1 High-priority user-facing bug, regression, or broken workflow. impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. impact:auth-provider Auth, provider routing, model choice, or SecretRef resolution may break. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. merge-risk: 🚨 auth-provider 🚨 May break OAuth, tokens, provider routing, model choice, or credentials. labels May 18, 2026
kays0x added a commit to kays0x/openclaw that referenced this pull request May 19, 2026
…unner-owned shape

ClawSweeper review on openclaw#84046 flagged the prior count-only check as too
broad: concurrent transcript rewrites, file replacements, branch
changes, and same-user-count mutations could all keep the user-message
count unchanged while changing session state. Tighten the fence
disambiguation so a stat-fingerprint mismatch is only tolerated when
ALL of these hold:

- dev/ino match (no atomic replacement onto a new inode)
- birthtimeNs matches (no unlink+recreate-same-inode case; common on
  tmpfs/ext4 with rapid recreation, where dev and ino are reused)
- size grew (file is append-only between fence-set and assert)
- bytes [0, fenceSize) hash identical to the fenced prefix hash (no
  in-place rewrite of earlier transcript content)
- bytes [fenceSize, currentSize) contain only message entries with role
  assistant or tool (no new user-role turn, no malformed tail)

Anything else — replacement, shrink, in-place rewrite, new user turn,
or compaction-style mutation — remains a real takeover. Snapshot now
captures the prefix hash alongside the fingerprint at releaseForPrompt
and refreshSessionFileFence so the slow path can verify rigorously.

Adds tests covering in-place rewrite and unlink+recreate replacement
(both with prefix bytes coincidentally matching and roles that would
otherwise look runner-owned); existing append-only and new-user-turn
tests preserved.

Refs: openclaw#83436, openclaw#83615
Addresses ClawSweeper P1 finding on prior revision of openclaw#84046.
kays0x added a commit to kays0x/openclaw that referenced this pull request May 19, 2026
…t non-message tail entries

ClawSweeper re-review on openclaw#84046 flagged two P1 issues on the prior
revision of the append-only fence:

- The role allowlist used 'tool', but OpenClaw persists tool outputs
  as message.role === 'toolResult' (see
  src/agents/pi-embedded-runner/transcript-file-state.ts:182). Real
  tool-using turns would still false-trip the takeover. Update the
  allowlist to match isAgentMessage's runner-owned roles: assistant,
  toolResult, bashExecution.

- The slow path silently 'continue'd through non-message tail entries
  (custom, branch_summary, compaction, session). That let a different
  owner mutate session state during the released-lock window without
  a new user turn. Fail closed: any non-message entry in the tail
  rejects the fence.

Test updates: existing acceptance test now appends assistant +
toolResult + bashExecution (the three valid persisted runner-owned
roles); new test asserts a 'type: custom' tail entry trips takeover
even though it grew append-only.

Refs: openclaw#83436, openclaw#83615
Addresses ClawSweeper P1 findings on second revision of openclaw#84046.
kays0x added a commit to kays0x/openclaw that referenced this pull request May 19, 2026
…l entries

ClawSweeper P2 finding on the prior revision: the slow path accepted
any 'type: message' entry with role in {assistant, toolResult,
bashExecution} without checking the full persisted message contract.
Malformed or externally-synthesized entries carrying an accepted role
would advance the fence over state they do not actually represent.

Add a local isPersistedRunnerOwnedMessage validator that mirrors
isAgentMessage in transcript-file-state.ts for the three accepted
roles: assistant needs content present; toolResult needs string
toolCallId/toolName, boolean isError, array content; bashExecution
needs string command/output, boolean cancelled/truncated, and
number-or-undefined exitCode. Reject everything else.

Tests: existing acceptance test updated to use array content for
toolResult (matching the persisted schema); new test asserts a
toolResult message missing toolName/isError/content trips takeover
even though role is accepted.

Refs: openclaw#83436, openclaw#83615
Addresses ClawSweeper P2 finding on third revision of openclaw#84046.
kays0x added a commit to kays0x/openclaw that referenced this pull request May 19, 2026
ClawSweeper P2 finding on prior revision: the local validator was a
hand-rolled subset of isAgentMessage in transcript-file-state.ts, and
each round of review surfaced a new schema gap (assistant content
just had to exist; toolResult content just had to be an array;
bashExecution missed nothing here yet). Reproducing the contract
locally is the wrong layer — transcript-file-state.ts already owns
the canonical persisted message contract.

Export isAgentMessage from transcript-file-state.ts and call it from
the fence slow path, then gate on the runner-owned role subset
(assistant, toolResult, bashExecution). user-role and custom-role
messages remain real takeover signals even when their contract is
valid.

The local helper drops to ten lines: full contract check via the
canonical validator, then role-subset gate. No new tests needed —
the existing malformed-toolResult regression already covers the
schema mismatch the canonical validator now catches more strictly
(missing toolName/isError + non-array content).

Refs: openclaw#83436, openclaw#83615
Addresses ClawSweeper P2 finding on fourth revision of openclaw#84046.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling impact:auth-provider Auth, provider routing, model choice, or SecretRef resolution may break. impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. merge-risk: 🚨 auth-provider 🚨 May break OAuth, tokens, provider routing, model choice, or credentials. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant