EmbeddedAttemptSessionTakeoverError fires at ~120s on long Bedrock streams (fence whitelist too narrow?)

## Summary

`EmbeddedAttemptSessionTakeoverError` ("session file changed while embedded prompt lock was released") fires consistently around the 120s mark on long Bedrock streaming runs, killing the turn even though no other agent or runner is touching the session. The fence's benign-rewrite whitelist appears too narrow for legitimate concurrent writes that happen during normal streaming + delivery flows.

## Environment

- OpenClaw: `2026.5.22` (npm install, host Linux 6.17 x64, node v24.14.0)
- Bedrock provider: `@openclaw/amazon-bedrock-provider` 2026.5.22
- Model: `amazon-bedrock/zai.glm-5` via `bedrock-converse-stream` API
- Channels: Slack DM (socket mode), GitHub webhook → `/hooks/github-engineering`, cron-nested isolated agentTurns
- Lanes affected: `main`, `cron-nested`, `session:agent:main:slack:direct:<user>:thread:<ts>`

## Reliable repro pattern

1. Long Slack DM session with multiple `exec` + `gh` tool calls.
2. After the last tool result, the model needs to compose a long final reply (≥ ~120s of streaming).
3. At ~120s after the last toolResult, the runtime writes an empty `provider:"amazon-bedrock" model:"zai.glm-5"` assistant entry and throws `EmbeddedAttemptSessionTakeoverError`.

Failure timestamps observed (same session, two distinct user messages):
- Run 1: last toolResult `21:44:02.xxx`, empty assistant + throw at `21:46:02.573` (`durationMs ~129017`).
- Run 2: last toolResult `22:02:48.xxx`, empty assistant + throw at `22:04:48.944` (`durationMs ~151455`).

Same error type also fires from cron-nested lanes and from the github-engineering hook agentTurns running on completely different sessionFiles.

## What the code path looks like

- `dist/pi-embedded-CsSFzly6.js:159` — `enqueueCommandInLane(sessionLane, () => enqueueGlobal(...))` — the two simultaneous lane errors per failure are the same single failure unwinding nested lanes (sessionLane outer, globalLane inner), not two writers.
- `dist/selection-hR-AeOeU.js:7998` — `TRANSCRIPT_ONLY_OPENCLAW_ASSISTANT_MODELS = new Set(["delivery-mirror","gateway-injected"])`. Anything else flips `takeoverDetected = true`.
- `dist/selection-hR-AeOeU.js:8086/8097` — `sessionFenceAdvanceIsBenign` / `sessionFenceRewriteIsBenign` only allow lines whose model is in that whitelist.
- `dist/selection-hR-AeOeU.js:8210` — class `EmbeddedAttemptSessionTakeoverError`; thrown at `:8324`/`:8387`/`:8440`/`:8530`.

The failure stub written at the takeover moment has `provider:"amazon-bedrock", model:"zai.glm-5"` — i.e. the very record the runtime writes itself when the prompt lock is released — but on reacquire that line is treated as foreign.

## Hypothesis

One of these (in order of likelihood) is writing during the prompt-lock-released window and tripping the fence:

1. Streaming partial assistant deltas being persisted mid-prompt by the chat handler (`dist/chat-zFy9Y_4Y.js:1351 fs.writeFileSync(params.transcriptPath, ...)`).
2. Delivery-mirror writers running on a different lane (`dist/deliver-WPtVqUMT.js:1287`, `dist/run-delivery.runtime-B3LSluU0.js:366`, `dist/message-action-runner-B4oH5EYj.js:908`) — but those use `model:"delivery-mirror"` which IS whitelisted, so probably not these.
3. The runtime's own failure-stub writer racing with the fence reacquire.

## Suggested fixes

- Extend the benign-rewrite whitelist to recognise the runtime's own failure stubs (e.g. empty-content assistant lines whose `runId` matches the still-active attempt).
- Or: scope the fence check to lines written *after* lock release with a different `runId`, not "any line that doesn't match the whitelist".
- Or: surface a config-level escape hatch to relax the fence per agent (`agents.<id>.session.fenceMode = "warn" | "strict"`).

## Mitigation we applied locally

Not a fix, just headroom:
```
OPENCLAW_SESSION_WRITE_LOCK_ACQUIRE_TIMEOUT_MS=120000  # was 60000
OPENCLAW_SESSION_WRITE_LOCK_MAX_HOLD_MS=600000          # was 300000
```

These help when the lock is the contention point. They don't address the fence whitelist itself.

## Logs / artefacts available on request

- Full session jsonl + trajectory (Slack DM session 0b213e58-...) showing both failures.
- Gateway logs around 21:46:02 and 22:04:48 UTC.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EmbeddedAttemptSessionTakeoverError fires at ~120s on long Bedrock streams (fence whitelist too narrow?) #89259

Summary

Environment

Reliable repro pattern

What the code path looks like

Hypothesis

Suggested fixes

Mitigation we applied locally

Logs / artefacts available on request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

EmbeddedAttemptSessionTakeoverError fires at ~120s on long Bedrock streams (fence whitelist too narrow?) #89259

Description

Summary

Environment

Reliable repro pattern

What the code path looks like

Hypothesis

Suggested fixes

Mitigation we applied locally

Logs / artefacts available on request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions