Skip to content

EmbeddedAttemptSessionTakeoverError fires at ~120s on long Bedrock streams (fence whitelist too narrow?) #89259

@swiser-nexa

Description

@swiser-nexa

Summary

EmbeddedAttemptSessionTakeoverError ("session file changed while embedded prompt lock was released") fires consistently around the 120s mark on long Bedrock streaming runs, killing the turn even though no other agent or runner is touching the session. The fence's benign-rewrite whitelist appears too narrow for legitimate concurrent writes that happen during normal streaming + delivery flows.

Environment

  • OpenClaw: 2026.5.22 (npm install, host Linux 6.17 x64, node v24.14.0)
  • Bedrock provider: @openclaw/amazon-bedrock-provider 2026.5.22
  • Model: amazon-bedrock/zai.glm-5 via bedrock-converse-stream API
  • Channels: Slack DM (socket mode), GitHub webhook → /hooks/github-engineering, cron-nested isolated agentTurns
  • Lanes affected: main, cron-nested, session:agent:main:slack:direct:<user>:thread:<ts>

Reliable repro pattern

  1. Long Slack DM session with multiple exec + gh tool calls.
  2. After the last tool result, the model needs to compose a long final reply (≥ ~120s of streaming).
  3. At ~120s after the last toolResult, the runtime writes an empty provider:"amazon-bedrock" model:"zai.glm-5" assistant entry and throws EmbeddedAttemptSessionTakeoverError.

Failure timestamps observed (same session, two distinct user messages):

  • Run 1: last toolResult 21:44:02.xxx, empty assistant + throw at 21:46:02.573 (durationMs ~129017).
  • Run 2: last toolResult 22:02:48.xxx, empty assistant + throw at 22:04:48.944 (durationMs ~151455).

Same error type also fires from cron-nested lanes and from the github-engineering hook agentTurns running on completely different sessionFiles.

What the code path looks like

  • dist/pi-embedded-CsSFzly6.js:159enqueueCommandInLane(sessionLane, () => enqueueGlobal(...)) — the two simultaneous lane errors per failure are the same single failure unwinding nested lanes (sessionLane outer, globalLane inner), not two writers.
  • dist/selection-hR-AeOeU.js:7998TRANSCRIPT_ONLY_OPENCLAW_ASSISTANT_MODELS = new Set(["delivery-mirror","gateway-injected"]). Anything else flips takeoverDetected = true.
  • dist/selection-hR-AeOeU.js:8086/8097sessionFenceAdvanceIsBenign / sessionFenceRewriteIsBenign only allow lines whose model is in that whitelist.
  • dist/selection-hR-AeOeU.js:8210 — class EmbeddedAttemptSessionTakeoverError; thrown at :8324/:8387/:8440/:8530.

The failure stub written at the takeover moment has provider:"amazon-bedrock", model:"zai.glm-5" — i.e. the very record the runtime writes itself when the prompt lock is released — but on reacquire that line is treated as foreign.

Hypothesis

One of these (in order of likelihood) is writing during the prompt-lock-released window and tripping the fence:

  1. Streaming partial assistant deltas being persisted mid-prompt by the chat handler (dist/chat-zFy9Y_4Y.js:1351 fs.writeFileSync(params.transcriptPath, ...)).
  2. Delivery-mirror writers running on a different lane (dist/deliver-WPtVqUMT.js:1287, dist/run-delivery.runtime-B3LSluU0.js:366, dist/message-action-runner-B4oH5EYj.js:908) — but those use model:"delivery-mirror" which IS whitelisted, so probably not these.
  3. The runtime's own failure-stub writer racing with the fence reacquire.

Suggested fixes

  • Extend the benign-rewrite whitelist to recognise the runtime's own failure stubs (e.g. empty-content assistant lines whose runId matches the still-active attempt).
  • Or: scope the fence check to lines written after lock release with a different runId, not "any line that doesn't match the whitelist".
  • Or: surface a config-level escape hatch to relax the fence per agent (agents.<id>.session.fenceMode = "warn" | "strict").

Mitigation we applied locally

Not a fix, just headroom:

OPENCLAW_SESSION_WRITE_LOCK_ACQUIRE_TIMEOUT_MS=120000  # was 60000
OPENCLAW_SESSION_WRITE_LOCK_MAX_HOLD_MS=600000          # was 300000

These help when the lock is the contention point. They don't address the fence whitelist itself.

Logs / artefacts available on request

  • Full session jsonl + trajectory (Slack DM session 0b213e58-...) showing both failures.
  • Gateway logs around 21:46:02 and 22:04:48 UTC.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions