-
-
Notifications
You must be signed in to change notification settings - Fork 79.1k
[Bug]: Embedded agent failover treats session-file mutation as model failure and exhausts all fallbacks #83510
Copy link
Copy link
Closed
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.bugSomething isn't workingSomething isn't workingclawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.ClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.Auth, provider routing, model choice, or SecretRef resolution may break.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.regressionBehavior that previously worked and now failsBehavior that previously worked and now fails
Metadata
Metadata
Assignees
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.bugSomething isn't workingSomething isn't workingclawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.ClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.Auth, provider routing, model choice, or SecretRef resolution may break.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.regressionBehavior that previously worked and now failsBehavior that previously worked and now fails
Type
Fields
Give feedbackNo fields configured for issues without a type.
Bug type
Regression (worked before, now fails)
Beta release blocker
No
Summary
OpenClaw 2026.5.16-beta.6 can fail an embedded agent turn before replying when the active session JSONL changes while the embedded prompt lock is released. The failure is classified as a model/candidate failure, so the same local session-file mutation is retried across unrelated fallback models. This exhausts the fallback chain and surfaces as “All models failed”, even though the first two failures are not provider/model failures.
Steps to reproduce
Expected behavior
A local session takeover/session-file mutation should be treated as a runtime coordination failure, not as a model failure.
The system should either:
Fallback should be reserved for provider/model failures such as timeout, rate limit, auth, or provider runtime errors.
Actual behavior
The embedded run fails before reply with All models failed. The same session-file mutation is counted against multiple model candidates:
The user sees a generic model-chain failure even though the primary cause is local session state changing underneath the embedded run.
OpenClaw version
2026.5.16-beta6
Operating system
macOS 26.5
Install method
npm global
Model
openai/gpt-5.5
Provider / routing chain
openclaw->openai/gpt-5.5
Additional provider/model setup details
No response
Logs, screenshots, and evidence
Impact and severity
Severity: High for interactive reliability.
Impact:
Suggested direction:
Additional information
Environment
Logs, screenshots, and evidence
Gateway log evidence from /tmp/openclaw/openclaw-2026-05-18.log showed repeated occurrences of this error pattern. A local grep found 61 occurrences of:
session file changed while embedded prompt lock was released
Concrete user-visible failure at 2026-05-18T14:09:40.850+07:00:
Embedded agent failed before reply: All models failed (3): openai/gpt-5.5: session file changed while embedded prompt lock was released: /Users/sompisjunsui/.openclaw/agents/main/sessions/67ebbe47-a99f-4eec-9524-728658b5f6a2.jsonl (unknown) | claude-bridge/claude-opus-4-7: session file changed while embedded prompt lock was released: /Users/sompisjunsui/.openclaw/agents/main/sessions/67ebbe47-a99f-4eec-9524-728658b5f6a2.jsonl (unknown) | google/gemini-3.1-pro-preview: LLM idle timeout (120s): no response from model (timeout) | LLM request timed out.
Nearby diagnostic/fallback evidence showed the same local session mutation being recorded as model fallback candidate failures:
lane task error: lane=main durationMs=36956 error="EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: /Users/sompisjunsui/.openclaw/agents/main/sessions/67ebbe47-a99f-4eec-9524-728658b5f6a2.jsonl"
model_fallback_decision: candidate_failed, candidate=openai/gpt-5.5, errorPreview="session file changed while embedded prompt lock was released: /Users/sompisjunsui/.openclaw/agents/main/sessions/67ebbe47-a99f-4eec-9524-728658b5f6a2.jsonl", fallbackStepFinalOutcome="next_fallback"
model_fallback_decision: candidate_failed, candidate=claude-bridge/claude-opus-4-7, errorPreview="session file changed while embedded prompt lock was released: /Users/sompisjunsui/.openclaw/agents/main/sessions/67ebbe47-a99f-4eec-9524-728658b5f6a2.jsonl", fallbackStepFinalOutcome="next_fallback"
model_fallback_decision: candidate_failed, candidate=google/gemini-3.1-pro-preview, reason="timeout", errorPreview="LLM idle timeout (120s): no response from model", fallbackStepFinalOutcome="chain_exhausted"
Additional context from nearby logs:
Workaround
Rollback to 2026.5.16-beta5