-
-
Notifications
You must be signed in to change notification settings - Fork 79.2k
EmbeddedAttemptSessionTakeoverError fires on legitimate co-tenant writes to shared sessions (regression in 2026.5.17) #84071
Copy link
Copy link
Closed as not planned
Closed as not planned
Copy link
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.ClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Metadata
Metadata
Assignees
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.ClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Summary
The new fingerprint-based session-takeover fence introduced in 2026.5.17 (
Agents/sessions: ... release the embedded run's coarse transcript lock before model I/O while locking persistence and cleanup separately. Fixes #13744) treats any write to the session jsonl during thereleaseForPrompt()window as adversarial takeover — including writes from legitimate co-tenants on the same session (heartbeat, cron, channel ingress) that go through theinstallSessionEventWriteLock/installSessionExternalHookWriteLockhooks.Once tripped,
hasSessionTakeover()is sticky and every subsequentwithSessionWriteLockcall throws. The diagnostic surfaces as a stalled session withrecovery=none; the user-facing TUI shows "gateway disconnected: closed | idle" because the WS lane stalls atmodel_call:startedand never streams.Environment
main, heartbeat 30m (default),kimi-k2.6:cloudprimary viaollama-ironprovider (~100s typical model call)agent:main:mainis also used by 8+ cron jobs and a Discord channelReproduction
Heartbeat 30m(default).withSessionWriteLockthrowsEmbeddedAttemptSessionTakeoverError;model_callstalls; subsequent retries on the same controller also throw.Observed
Journal:
Both
lane=mainandlane=session:agent:main:mainerror at the same instant on the same session file with near-identicaldurationMs(off by 2–3 ms across multiple occurrences), confirming a within-process race rather than an external-process modification. Reproduced 4× in 2 hours onagent:main:main— cadence matches heartbeat (30 min).Expected
The fence should distinguish writes by registered co-tenants (which already synchronize via
installSessionEventWriteLock/installSessionExternalHookWriteLock) from external/uncoordinated mutators. A coordinated write should either (a) participate in the fingerprint by refreshing it under the write lock, or (b) not trip the fence at all.Alternatively, provide a recovery path so the controller can re-fingerprint and resume after a legitimate concurrent write, rather than becoming permanently stuck on
recovery=none.Code references (2026.5.18 bundle)
dist/plugin-sdk/src/agents/pi-embedded-runner/run/attempt.session-lock.d.tsdist/selection-Cr-9-UpD.jslines ~7827 (error class), ~7884 (createEmbeddedAttemptSessionLockController), ~7911 (assertSessionFileFence), ~7919 (refreshSessionFileFence)session.writeLock.{acquireTimeoutMs, staleMs, maxHoldMs}(and correspondingOPENCLAW_SESSION_WRITE_LOCK_*env vars) do not affect the fence — it is fingerprint-based, not timeout-based.Workarounds