fix(agents): rethrow EmbeddedAttemptSessionTakeoverError before model fallback#83436
Conversation
|
Codex review: needs real behavior proof before merge. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. source-level reproduction is high-confidence: current main has no embedded takeover carve-out before fallback classification, and a multi-candidate fallback run whose first callback throws the takeover error will record a candidate failure instead of rethrowing. I did not reproduce the reported live Discord cadence in this read-only review. PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review findings
Review detailsBest possible solution: Use a recursive session-ownership detector at the fallback boundary, keep the embedded takeover fail-fast path, add wrapped-lock regression coverage, and land only with redacted after-fix runtime proof. Do we have a high-confidence way to reproduce the issue? Yes, source-level reproduction is high-confidence: current main has no embedded takeover carve-out before fallback classification, and a multi-candidate fallback run whose first callback throws the takeover error will record a candidate failure instead of rethrowing. I did not reproduce the reported live Discord cadence in this read-only review. Is this the best way to solve the issue? No, not yet. The embedded takeover direction is narrow and appropriate, but the write-lock portion should preserve the recursive wrapped-error handling from the prior related fix before this is the best mergeable solution. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against e3d55188384b. |
…unner-owned shape ClawSweeper review on openclaw#84046 flagged the prior count-only check as too broad: concurrent transcript rewrites, file replacements, branch changes, and same-user-count mutations could all keep the user-message count unchanged while changing session state. Tighten the fence disambiguation so a stat-fingerprint mismatch is only tolerated when ALL of these hold: - dev/ino match (no atomic replacement onto a new inode) - birthtimeNs matches (no unlink+recreate-same-inode case; common on tmpfs/ext4 with rapid recreation, where dev and ino are reused) - size grew (file is append-only between fence-set and assert) - bytes [0, fenceSize) hash identical to the fenced prefix hash (no in-place rewrite of earlier transcript content) - bytes [fenceSize, currentSize) contain only message entries with role assistant or tool (no new user-role turn, no malformed tail) Anything else — replacement, shrink, in-place rewrite, new user turn, or compaction-style mutation — remains a real takeover. Snapshot now captures the prefix hash alongside the fingerprint at releaseForPrompt and refreshSessionFileFence so the slow path can verify rigorously. Adds tests covering in-place rewrite and unlink+recreate replacement (both with prefix bytes coincidentally matching and roles that would otherwise look runner-owned); existing append-only and new-user-turn tests preserved. Refs: openclaw#83436, openclaw#83615 Addresses ClawSweeper P1 finding on prior revision of openclaw#84046.
…t non-message tail entries ClawSweeper re-review on openclaw#84046 flagged two P1 issues on the prior revision of the append-only fence: - The role allowlist used 'tool', but OpenClaw persists tool outputs as message.role === 'toolResult' (see src/agents/pi-embedded-runner/transcript-file-state.ts:182). Real tool-using turns would still false-trip the takeover. Update the allowlist to match isAgentMessage's runner-owned roles: assistant, toolResult, bashExecution. - The slow path silently 'continue'd through non-message tail entries (custom, branch_summary, compaction, session). That let a different owner mutate session state during the released-lock window without a new user turn. Fail closed: any non-message entry in the tail rejects the fence. Test updates: existing acceptance test now appends assistant + toolResult + bashExecution (the three valid persisted runner-owned roles); new test asserts a 'type: custom' tail entry trips takeover even though it grew append-only. Refs: openclaw#83436, openclaw#83615 Addresses ClawSweeper P1 findings on second revision of openclaw#84046.
…l entries
ClawSweeper P2 finding on the prior revision: the slow path accepted
any 'type: message' entry with role in {assistant, toolResult,
bashExecution} without checking the full persisted message contract.
Malformed or externally-synthesized entries carrying an accepted role
would advance the fence over state they do not actually represent.
Add a local isPersistedRunnerOwnedMessage validator that mirrors
isAgentMessage in transcript-file-state.ts for the three accepted
roles: assistant needs content present; toolResult needs string
toolCallId/toolName, boolean isError, array content; bashExecution
needs string command/output, boolean cancelled/truncated, and
number-or-undefined exitCode. Reject everything else.
Tests: existing acceptance test updated to use array content for
toolResult (matching the persisted schema); new test asserts a
toolResult message missing toolName/isError/content trips takeover
even though role is accepted.
Refs: openclaw#83436, openclaw#83615
Addresses ClawSweeper P2 finding on third revision of openclaw#84046.
ClawSweeper P2 finding on prior revision: the local validator was a hand-rolled subset of isAgentMessage in transcript-file-state.ts, and each round of review surfaced a new schema gap (assistant content just had to exist; toolResult content just had to be an array; bashExecution missed nothing here yet). Reproducing the contract locally is the wrong layer — transcript-file-state.ts already owns the canonical persisted message contract. Export isAgentMessage from transcript-file-state.ts and call it from the fence slow path, then gate on the runner-owned role subset (assistant, toolResult, bashExecution). user-role and custom-role messages remain real takeover signals even when their contract is valid. The local helper drops to ten lines: full contract check via the canonical validator, then role-subset gate. No new tests needed — the existing malformed-toolResult regression already covers the schema mismatch the canonical validator now catches more strictly (missing toolName/isError + non-array content). Refs: openclaw#83436, openclaw#83615 Addresses ClawSweeper P2 finding on fourth revision of openclaw#84046.
Summary
EmbeddedAttemptSessionTakeoverError(the embedded-attempt fingerprint guard that fires when the session JSONL changes during the released-lock window) is currently routed throughrunWithModelFallbackas an ordinary candidate failure. Every model candidate fails with the same takeover error, the chain exhausts, the last fallback succeeds, and the session ends up effectively pinned to that fallback model — with no provider error and no normal fallback receipt.This PR extends the existing carve-out family (
isCommandLaneTaskTimeoutError,shouldRethrowAbort) to also rethrowEmbeddedAttemptSessionTakeoverErrorandSessionWriteLockTimeoutErrorbefore fallback classification.Repro
Observed on Discord-channel-bound sessions under normal traffic. journalctl shows:
Then the successful fallback persists as
modelOverrideSource: auto. Manual re-pins decay back to the same fallback within minutes.Four complete chain-walks observed on one seat between 20:13 and 20:30 PDT, cadence ~5 min.
Why this presents as silent reassertion
EmbeddedAttemptSessionTakeoverErroris a local session-state guard (introduced in Release embedded session write lock before model I/O #82891). It is not a provider failure.describeFailoverError()does not classify it as a provider-class failure, so[model-fallback/decision]logsreason=unknown. The clue only survives indetail=....Relation to #66646 / #78633
Issue #66646 (
Session file lock errors cascade through model fallback chain) was fixed by #78633, which exportedhasSessionWriteLockTimeoutand the carve-out narrative. However, the runtime intercept forrunWithModelFallbackwas not retained onmain:hasSessionWriteLockTimeoutis only referenced fromfailover-error.tsdescribe/classify paths, not from the fallback runner itself.EmbeddedAttemptSessionTakeoverError(introduced by #82891) extends the same family but does not yet have an interceptor anywhere in the model-fallback chain. This PR adds both interceptors inrunFallbackCandidate.Change
The pattern matches the existing
isCommandLaneTaskTimeoutErrorcarve-out exactly.Tests
Added regression in
src/agents/model-fallback.test.ts:onError(the candidate-failed receipt path)Local verification on this branch:
pnpm vitest run src/agents/model-fallback.test.ts -t "session takeover|command-lane watchdog|unrecognized errors"→ 6 passed / 108 skipped (114 total)pnpm vitest run src/agents/model-fallback.test.ts→ 2 files / 114 tests passedSearch key for future operators
session file changed while embedded prompt lock was releasedThis is the signature in journalctl. With this fix it stays a session-takeover error and is handled by the embedded runner / retry path instead of poisoning model fallback.
Out of scope (follow-up lanes)
[model-fallback/decision] reason=unknownshould preserve the underlying error class in the decision-line reason field. The detail string carries it, but the reason field collapses tounknown. Surfacing-(2) prevents the next masked-as-unknown bug from being invisible.cc @sallyom (author of #78633, related fix)