Skip to content

[Bug]: Embedded agent failover treats session-file mutation as model failure and exhausts all fallbacks #83510

@jsompis

Description

@jsompis

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

OpenClaw 2026.5.16-beta.6 can fail an embedded agent turn before replying when the active session JSONL changes while the embedded prompt lock is released. The failure is classified as a model/candidate failure, so the same local session-file mutation is retried across unrelated fallback models. This exhausts the fallback chain and surfaces as “All models failed”, even though the first two failures are not provider/model failures.

Steps to reproduce

  1. Use a long-lived embedded agent session with automatic context/maintenance activity enabled.
  2. Trigger an embedded agent turn on the active main session.
  3. While the embedded prompt lock is released, allow a local process such as context compaction, maintenance, memory sync, or another gateway write path to append/update the same session JSONL file.
  4. Observe model fallback behavior for the turn.

Expected behavior

A local session takeover/session-file mutation should be treated as a runtime coordination failure, not as a model failure.

The system should either:

  • abort the current turn with a clear non-model error and avoid consuming fallback models, or
  • restart/rebase the turn safely after re-reading the updated session state, if that is supported.

Fallback should be reserved for provider/model failures such as timeout, rate limit, auth, or provider runtime errors.

Actual behavior

The embedded run fails before reply with All models failed. The same session-file mutation is counted against multiple model candidates:

  • openai/gpt-5.5 fails with: session file changed while embedded prompt lock was released
  • claude-bridge/claude-opus-4-7 fails with the same local session-file mutation
  • google/gemini-3.1-pro-preview then times out

The user sees a generic model-chain failure even though the primary cause is local session state changing underneath the embedded run.

OpenClaw version

2026.5.16-beta6

Operating system

macOS 26.5

Install method

npm global

Model

openai/gpt-5.5

Provider / routing chain

openclaw->openai/gpt-5.5

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

Severity: High for interactive reliability.

Impact:

  • User-facing turns fail before any assistant reply.
  • Fallback models are wasted on a non-model local coordination error.
  • The final error message misleads operators toward provider/model diagnosis instead of session lock/session mutation handling.
  • Repeated occurrences make long-lived or maintenance-heavy sessions unreliable.
  • This can mask the true cause when the last fallback happens to time out, because the final surfaced error is All models failed rather than a deterministic session takeover error.

Suggested direction:

  • Classify EmbeddedAttemptSessionTakeoverError/session-file mutation as a non-provider runtime coordination error.
  • Do not count this error against model fallback candidates.
  • Either fail fast with a clear session-concurrency message or retry only after safely rebuilding prompt/context from the updated session file.

Additional information

Environment

  • Product: OpenClaw
  • Version observed in gateway/session UI: 2026.5.16-beta.6
  • Host: macOS 26.5
  • Node runtime in logs: node 24.15.0
  • Main session key: agent:main:main
  • Affected session file: /Users/sompisjunsui/.openclaw/agents/main/sessions/67ebbe47-a99f-4eec-9524-728658b5f6a2.jsonl
  • Models/fallback chain observed: openai/gpt-5.5 -> claude-bridge/claude-opus-4-7 -> google/gemini-3.1-pro-preview

Logs, screenshots, and evidence

Gateway log evidence from /tmp/openclaw/openclaw-2026-05-18.log showed repeated occurrences of this error pattern. A local grep found 61 occurrences of:

session file changed while embedded prompt lock was released

Concrete user-visible failure at 2026-05-18T14:09:40.850+07:00:

Embedded agent failed before reply: All models failed (3): openai/gpt-5.5: session file changed while embedded prompt lock was released: /Users/sompisjunsui/.openclaw/agents/main/sessions/67ebbe47-a99f-4eec-9524-728658b5f6a2.jsonl (unknown) | claude-bridge/claude-opus-4-7: session file changed while embedded prompt lock was released: /Users/sompisjunsui/.openclaw/agents/main/sessions/67ebbe47-a99f-4eec-9524-728658b5f6a2.jsonl (unknown) | google/gemini-3.1-pro-preview: LLM idle timeout (120s): no response from model (timeout) | LLM request timed out.

Nearby diagnostic/fallback evidence showed the same local session mutation being recorded as model fallback candidate failures:

lane task error: lane=main durationMs=36956 error="EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: /Users/sompisjunsui/.openclaw/agents/main/sessions/67ebbe47-a99f-4eec-9524-728658b5f6a2.jsonl"

model_fallback_decision: candidate_failed, candidate=openai/gpt-5.5, errorPreview="session file changed while embedded prompt lock was released: /Users/sompisjunsui/.openclaw/agents/main/sessions/67ebbe47-a99f-4eec-9524-728658b5f6a2.jsonl", fallbackStepFinalOutcome="next_fallback"

model_fallback_decision: candidate_failed, candidate=claude-bridge/claude-opus-4-7, errorPreview="session file changed while embedded prompt lock was released: /Users/sompisjunsui/.openclaw/agents/main/sessions/67ebbe47-a99f-4eec-9524-728658b5f6a2.jsonl", fallbackStepFinalOutcome="next_fallback"

model_fallback_decision: candidate_failed, candidate=google/gemini-3.1-pro-preview, reason="timeout", errorPreview="LLM idle timeout (120s): no response from model", fallbackStepFinalOutcome="chain_exhausted"

Additional context from nearby logs:

  • lossless-claw/context maintenance was active around the same session.
  • The session file was below the auto-rotate size threshold, so this was not simply a large-file rotation case.
  • The same session path appears consistently across the failed candidates.

Workaround

Rollback to 2026.5.16-beta5

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.bugSomething isn't workingclawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.regressionBehavior that previously worked and now fails

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions