Bug Description
Cron jobs using idealab/claude-opus-4-6 as the model consistently fail with EmbeddedAttemptSessionTakeoverError when the session file fingerprint (dev/ino/size/mtimeNs/ctimeNs) changes between releaseForPrompt() and the subsequent assertSessionFileFence() check.
The modification is self-inflicted — the gateway's own internal async process (likely memory-core plugin indexing, model-snapshot write, or trajectory sync) modifies the .jsonl session file during the lock-free window while waiting for the model response.
Reproduction
- Version: 2026.5.20
- Trigger: Any isolated cron job with
idealab/claude-opus-4-6 that has a ~20s+ model response time
- Frequency: 100% reproducible once timing corridor is hit (7/7 consecutive failures for the same job)
- Workaround: Switching to a different provider (e.g.
dashscope/deepseek-v4-pro or even idealab/gpt-5.4) avoids the issue — suggesting the race is provider-specific in the streaming/auth initialization path
Evidence
-
The error appears in logs since 5/19, hitting multiple job types intermittently:
memory-capture-fallback (ops agent)
daily-ops-review (ops agent)
dreaming-narrative (main agent)
- Dashboard sessions (main agent)
-
Same agent + same model + different prompt size = different outcome:
- Short prompt job (capture-fallback, ~4s model response) → succeeds
- Long prompt job (daily-ops-review, ~24s model response) → fails every time
-
Gateway restart does NOT fix it (confirmed: restarted, immediately failed again)
-
Switching model to non-Opus provider → session lock error disappears (fails on tool error instead, but the lock race is gone)
Root Cause Analysis
Source: dist/selection-BmjEdnnA.js lines 7945-8050
// releaseForPrompt() records fingerprint then releases lock
async releaseForPrompt() {
fenceFingerprint = await readSessionFileFingerprint(sessionFile);
fenceActive = true;
await lock.release();
}
// assertSessionFileFence() checks fingerprint hasn't changed
async function assertSessionFileFence() {
const current = await readSessionFileFingerprint(sessionFile);
if (!sameSessionFileFingerprint(fenceFingerprint, current)) {
// Only exception: growth is pure assistant transcript entries
if (await changeLooksLikeOwnedPromptOutput({...})) {
fenceFingerprint = current; return;
}
throw new EmbeddedAttemptSessionTakeoverError(sessionFile);
}
}
// Fingerprint uses nanosecond-precision mtime
function sameSessionFileFingerprint(left, right) {
return left.dev === right.dev && left.ino === right.ino
&& left.size === right.size
&& left.mtimeNs === right.mtimeNs
&& left.ctimeNs === right.ctimeNs;
}
The fingerprint comparison is correct in principle but the invariant assumption ("no internal process will write to the session file during the lock-free window") is violated by the gateway's own async pipeline.
Suggested Fixes
- Drain all pending session writes before recording fingerprint — ensure no async internal write is in-flight when
releaseForPrompt() snapshots the fingerprint
- Relax fingerprint to size-only — if the file grew by known-good internal entries (not just assistant output), allow it
- Add grace period — re-read fingerprint after a short delay if mismatch detected, to handle writes that were "in the pipeline" at snapshot time
- Provider-aware lock timing — if certain providers trigger additional session writes during auth/streaming setup, account for them in the lock lifecycle
Environment
- macOS 14.6 (arm64)
- Node v22.19.0
- OpenClaw 2026.5.20
- Plugins: browser, memory-core, searxng, skill-trigger-engine
- 4 configured agents (main, ops, scout, editor)
Bug Description
Cron jobs using
idealab/claude-opus-4-6as the model consistently fail withEmbeddedAttemptSessionTakeoverErrorwhen the session file fingerprint (dev/ino/size/mtimeNs/ctimeNs) changes betweenreleaseForPrompt()and the subsequentassertSessionFileFence()check.The modification is self-inflicted — the gateway's own internal async process (likely memory-core plugin indexing, model-snapshot write, or trajectory sync) modifies the
.jsonlsession file during the lock-free window while waiting for the model response.Reproduction
idealab/claude-opus-4-6that has a ~20s+ model response timedashscope/deepseek-v4-proor evenidealab/gpt-5.4) avoids the issue — suggesting the race is provider-specific in the streaming/auth initialization pathEvidence
The error appears in logs since 5/19, hitting multiple job types intermittently:
memory-capture-fallback(ops agent)daily-ops-review(ops agent)dreaming-narrative(main agent)Same agent + same model + different prompt size = different outcome:
Gateway restart does NOT fix it (confirmed: restarted, immediately failed again)
Switching model to non-Opus provider → session lock error disappears (fails on tool error instead, but the lock race is gone)
Root Cause Analysis
Source:
dist/selection-BmjEdnnA.jslines 7945-8050The fingerprint comparison is correct in principle but the invariant assumption ("no internal process will write to the session file during the lock-free window") is violated by the gateway's own async pipeline.
Suggested Fixes
releaseForPrompt()snapshots the fingerprintEnvironment