Skip to content

EmbeddedAttemptSessionTakeoverError: concurrent lane tasks race on session .jsonl file #85633

@bob61x

Description

@bob61x

Bug Report: EmbeddedAttemptSessionTakeoverError causes "Something went wrong" in Feishu DM channel

Environment

  • OpenClaw version: 2026.5.20
  • Runtime: Node.js 24.14.0
  • OS: Linux 5.19.17 (NAS, Intel N97, 15GB RAM)
  • Channel: Feishu (direct message)
  • Deployment: Docker (openclaw-gateway + openclaw-cli)

Description

When receiving messages via the Feishu DM channel, the agent occasionally crashes with EmbeddedAttemptSessionTakeoverError, which gets surfaced to the user as "Something went wrong while processing your request". This appears to be caused by concurrent lane tasks racing on the same session .jsonl file.

Reproduction

  1. Use Feishu DM channel (dmPolicy: open)
  2. Send a message to the agent
  3. Occasionally (not every time), the error occurs

The error seems more likely to happen after a /new command followed quickly by another message, but also occurs during normal usage.

Observed Behavior

Today (2026-05-23), the error occurred 3 times on the same session lane:

Time (UTC+8) Session ID Error Durations
06:54 4c74b327-... EmbeddedAttemptSessionTakeoverError lane=main (14707ms) + lane=session:... (14709ms)
10:32 0d1c5ef1-... Same lane=main (74111ms) + lane=session:... (74116ms)
11:37 fb875090-... Same lane=main (14386ms) + lane=session:... (14389ms)

Key observations:

  • Both lane=main and lane=session:agent:main:feishu:direct:{user_id} fail simultaneously with nearly identical durations (within 3ms).
  • The error is always: session file changed while embedded prompt lock was released: /home/node/.openclaw/agents/main/sessions/{session_id}.jsonl
  • All failures originate from the same Feishu DM lane (session=agent:main:feishu:direct:ou_4ee1d4e556e4bc4a2d1b3a084716a82d).

Root Cause Analysis (from source code inspection)

The error originates in /app/dist/selection-BmjEdnnA.js:

async function assertSessionFileFence() {
    if (!fenceActive) return;
    const current = await readSessionFileFingerprint(params.lockOptions.sessionFile);
    if (!sameSessionFileFingerprint(fenceFingerprint, current)) {
        if (current.exists && await changeLooksLikeOwnedPromptOutput(...)) {
            fenceFingerprint = current; return;  // safe harbor for assistant output
        }
        takeoverDetected = true;
        throw new EmbeddedAttemptSessionTakeoverError(params.lockOptions.sessionFile);
    }
}

The problem: The releaseForPrompt() mechanism releases the session write lock while the LLM streams its response, but installs a "fence" to detect if the .jsonl file changes during that window. The changeLooksLikeOwnedPromptOutput() safe-harbor only allows assistant transcript entries to pass through without throwing. However, a non-assistant write (from another concurrent lane or task) triggers the error.

Evidence of concurrent lanes:

  • Every incident shows two lanes failing at the exact same millisecond (duration diff < 5ms).
  • This suggests the same dispatch spawns both lane=main and lane=session:..., and they race on the same session file.

Ruled Out

  • ❌ Docker permissions — fully verified (docker ps, docker info, docker exec all work)
  • auto-compaction — compaction events occur at different timestamps (11:16, 11:51) than errors (11:37)
  • session-memory hook — only writes to memory/ directory, not .jsonl
  • ❌ Cron jobs — enabled: true, jobs: 0, no jobs running during failures
  • ❌ PT MCP server — does not interact with session files

Relevant Log Snippet (11:37 incident)

11:36:41 Feishu DM: /new
11:36:41 dispatching to agent (session=agent:main:feishu:direct:...)
11:36:42 dispatch complete (queuedFinal=true, replies=1)

11:37:33 Feishu DM: "查看下你的docker权限都完整不"
11:37:33 dispatching to agent (session=agent:main:feishu:direct:...)
11:37:34 tool "_debug" from server "pt-mcp-server" registered...
11:37:47 lane task error: lane=main durationMs=14386 error="EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: ...fb875090-...jsonl"
11:37:47 lane task error: lane=session:... durationMs=14389 (same error)

Impact

  • User experience: intermittent "Something went wrong" errors
  • Frequency: ~3 times per day under normal usage
  • Session data is not corrupted, but the turn fails completely

Workaround

  • Avoid sending messages immediately after /new; wait 3-5 seconds for session initialization to complete.
  • Use /reset or /new periodically to prevent long-running sessions from accumulating race conditions.

Suggested Fix

The session write lock + fence mechanism may need to:

  1. Ensure only one lane task can hold the embedded prompt lock for a given session at a time, OR
  2. Extend the changeLooksLikeOwnedPromptOutput() safe-harbor to account for concurrent lane tasks writing to the same file, OR
  3. Serialize the dispatch so that lane=main and lane=session:... do not run concurrently on the same session.

Labels: bug, concurrency, session, feishu

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions