Skip to content

EmbeddedAttemptSessionTakeoverError: self-inflicted session file modification during lock-free window (race condition) #86804

@lovensky1992-wk

Description

@lovensky1992-wk

Bug Description

Cron jobs using idealab/claude-opus-4-6 as the model consistently fail with EmbeddedAttemptSessionTakeoverError when the session file fingerprint (dev/ino/size/mtimeNs/ctimeNs) changes between releaseForPrompt() and the subsequent assertSessionFileFence() check.

The modification is self-inflicted — the gateway's own internal async process (likely memory-core plugin indexing, model-snapshot write, or trajectory sync) modifies the .jsonl session file during the lock-free window while waiting for the model response.

Reproduction

  • Version: 2026.5.20
  • Trigger: Any isolated cron job with idealab/claude-opus-4-6 that has a ~20s+ model response time
  • Frequency: 100% reproducible once timing corridor is hit (7/7 consecutive failures for the same job)
  • Workaround: Switching to a different provider (e.g. dashscope/deepseek-v4-pro or even idealab/gpt-5.4) avoids the issue — suggesting the race is provider-specific in the streaming/auth initialization path

Evidence

  1. The error appears in logs since 5/19, hitting multiple job types intermittently:

    • memory-capture-fallback (ops agent)
    • daily-ops-review (ops agent)
    • dreaming-narrative (main agent)
    • Dashboard sessions (main agent)
  2. Same agent + same model + different prompt size = different outcome:

    • Short prompt job (capture-fallback, ~4s model response) → succeeds
    • Long prompt job (daily-ops-review, ~24s model response) → fails every time
  3. Gateway restart does NOT fix it (confirmed: restarted, immediately failed again)

  4. Switching model to non-Opus provider → session lock error disappears (fails on tool error instead, but the lock race is gone)

Root Cause Analysis

Source: dist/selection-BmjEdnnA.js lines 7945-8050

// releaseForPrompt() records fingerprint then releases lock
async releaseForPrompt() {
    fenceFingerprint = await readSessionFileFingerprint(sessionFile);
    fenceActive = true;
    await lock.release();
}

// assertSessionFileFence() checks fingerprint hasn't changed
async function assertSessionFileFence() {
    const current = await readSessionFileFingerprint(sessionFile);
    if (!sameSessionFileFingerprint(fenceFingerprint, current)) {
        // Only exception: growth is pure assistant transcript entries
        if (await changeLooksLikeOwnedPromptOutput({...})) {
            fenceFingerprint = current; return;
        }
        throw new EmbeddedAttemptSessionTakeoverError(sessionFile);
    }
}

// Fingerprint uses nanosecond-precision mtime
function sameSessionFileFingerprint(left, right) {
    return left.dev === right.dev && left.ino === right.ino
        && left.size === right.size
        && left.mtimeNs === right.mtimeNs
        && left.ctimeNs === right.ctimeNs;
}

The fingerprint comparison is correct in principle but the invariant assumption ("no internal process will write to the session file during the lock-free window") is violated by the gateway's own async pipeline.

Suggested Fixes

  1. Drain all pending session writes before recording fingerprint — ensure no async internal write is in-flight when releaseForPrompt() snapshots the fingerprint
  2. Relax fingerprint to size-only — if the file grew by known-good internal entries (not just assistant output), allow it
  3. Add grace period — re-read fingerprint after a short delay if mismatch detected, to handle writes that were "in the pipeline" at snapshot time
  4. Provider-aware lock timing — if certain providers trigger additional session writes during auth/streaming setup, account for them in the lock lifecycle

Environment

  • macOS 14.6 (arm64)
  • Node v22.19.0
  • OpenClaw 2026.5.20
  • Plugins: browser, memory-core, searxng, skill-trigger-engine
  • 4 configured agents (main, ops, scout, editor)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions