Skip to content

EmbeddedAttemptSessionTakeoverError fires on legitimate co-tenant writes to shared sessions (regression in 2026.5.17) #84071

@eekfonky

Description

@eekfonky

Summary

The new fingerprint-based session-takeover fence introduced in 2026.5.17 (Agents/sessions: ... release the embedded run's coarse transcript lock before model I/O while locking persistence and cleanup separately. Fixes #13744) treats any write to the session jsonl during the releaseForPrompt() window as adversarial takeover — including writes from legitimate co-tenants on the same session (heartbeat, cron, channel ingress) that go through the installSessionEventWriteLock / installSessionExternalHookWriteLock hooks.

Once tripped, hasSessionTakeover() is sticky and every subsequent withSessionWriteLock call throws. The diagnostic surfaces as a stalled session with recovery=none; the user-facing TUI shows "gateway disconnected: closed | idle" because the WS lane stalls at model_call:started and never streams.

Environment

  • OpenClaw 2026.5.18 (50a2481), Node 24.15.0, Linux LXC (Proxmox)
  • Gateway: local, loopback only, single-user
  • Default agent main, heartbeat 30m (default), kimi-k2.6:cloud primary via ollama-iron provider (~100s typical model call)
  • Shared session agent:main:main is also used by 8+ cron jobs and a Discord channel

Reproduction

  1. Configure default agent with Heartbeat 30m (default).
  2. Run any embedded turn through a slow provider (Ollama cloud, ~100s).
  3. Within ~30 minutes, heartbeat (or any other co-tenant) writes to the same session via the registered write-lock hooks while the model I/O window is open.
  4. The next withSessionWriteLock throws EmbeddedAttemptSessionTakeoverError; model_call stalls; subsequent retries on the same controller also throw.

Observed

Journal:

[diagnostic] lane task error: lane=main durationMs=116088
  error="EmbeddedAttemptSessionTakeoverError: session file changed while
  embedded prompt lock was released: ...sessions/<sid>.jsonl"
[diagnostic] lane task error: lane=session:agent:main:main durationMs=116091 ...
[model-fallback/decision] decision=candidate_failed ... reason=unknown
  detail=session file changed while embedded prompt lock was released
[diagnostic] stalled session: ... activeWorkKind=model_call
  lastProgress=model_call:started lastProgressAge=150s recovery=none

Both lane=main and lane=session:agent:main:main error at the same instant on the same session file with near-identical durationMs (off by 2–3 ms across multiple occurrences), confirming a within-process race rather than an external-process modification. Reproduced 4× in 2 hours on agent:main:main — cadence matches heartbeat (30 min).

Expected

The fence should distinguish writes by registered co-tenants (which already synchronize via installSessionEventWriteLock / installSessionExternalHookWriteLock) from external/uncoordinated mutators. A coordinated write should either (a) participate in the fingerprint by refreshing it under the write lock, or (b) not trip the fence at all.

Alternatively, provide a recovery path so the controller can re-fingerprint and resume after a legitimate concurrent write, rather than becoming permanently stuck on recovery=none.

Code references (2026.5.18 bundle)

  • dist/plugin-sdk/src/agents/pi-embedded-runner/run/attempt.session-lock.d.ts
  • dist/selection-Cr-9-UpD.js lines ~7827 (error class), ~7884 (createEmbeddedAttemptSessionLockController), ~7911 (assertSessionFileFence), ~7919 (refreshSessionFileFence)
  • The tunables session.writeLock.{acquireTimeoutMs, staleMs, maxHoldMs} (and corresponding OPENCLAW_SESSION_WRITE_LOCK_* env vars) do not affect the fence — it is fingerprint-based, not timeout-based.

Workarounds

  • Restart gateway to clear the stalled lane (resets controller state — works until the next co-tenant write).
  • Pin TUI / critical turns to a dedicated agent / session not shared with heartbeat / cron / channels.
  • Roll back to 2026.5.16 (last build before the fence was added).

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions