Skip to content

EmbeddedAttemptSessionTakeoverError: auto-compaction at reason=threshold trips fence on rewritten session jsonl #90729

@johnib

Description

@johnib

Summary

EmbeddedAttemptSessionTakeoverError fires deterministically when auto-compaction at reason=threshold runs mid-turn on an active session that has paired lanes (main + session:<scope>). Both lanes die in the same millisecond the compaction completes, no user-visible reply is delivered, and the user sees only the generic "Something went wrong while processing your request" fallback.

This is the same error class as #86508 / #86966 / #86845 / #88369 / #89259 / #86572, but I'm filing separately because the trigger is different from any open ticket I could find:

The point of a separate ticket is to give the compaction-specific variant its own deterministic repro so it doesn't get lost under the umbrella tickets.

Reproduction (deterministic on this deployment)

Single-user WhatsApp deployment, self-hosted Docker, 2026.5.20. The session in question (3a44d717-…) was 391 entries / 1.5 MB jsonl at the moment of failure. Auto-compaction fired at the threshold; 61 seconds later both lanes threw.

  1. Drive a WhatsApp DM session past agents.defaults.compaction.softThresholdTokens so the next inbound message will trigger reason=threshold. (agents.defaults.compaction.memoryFlush.enabled = false in our config — this is not a memoryFlush trigger.)
  2. Send a tool-heavy inbound message (in our trace: a request that drove the browser tool for several web searches).
  3. Auto-compaction starts ~24s into the turn.
  4. Compaction runs ~61s (calls the same provider as the turn).
  5. The moment compaction completes, both lane=main and lane=session:agent:main:whatsapp:direct:<peer> throw EmbeddedAttemptSessionTakeoverError on the same session file in the same millisecond.

Evidence (verbatim from logs)

2026-06-05T15:04:31.585+03:00 [whatsapp] Inbound message <peer> -> <self> (direct, audio/ogg, 131 chars)
2026-06-05T15:04:46.936+03:00 [whatsapp] Sent message ... (282ms)            ← prior turn reply OK
2026-06-05T15:05:09.717+03:00 [agent/embedded] embedded run auto-compaction start: runId=1ca5018a-… reason=threshold
2026-06-05T15:06:11.025+03:00 [agent/embedded] embedded run auto-compaction complete: runId=1ca5018a-… reason=threshold compactionCount=1 willRetry=false
2026-06-05T15:06:11.041+03:00 [diagnostic] lane task error: lane=main durationMs=98855 error="EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: /home/node/.openclaw/agents/main/sessions/3a44d717-bec8-4fe6-9128-3a8771a6fab1.jsonl"
2026-06-05T15:06:11.043+03:00 [diagnostic] lane task error: lane=session:agent:main:whatsapp:direct:<peer> durationMs=98858 error="EmbeddedAttemptSessionTakeoverError: ... /home/node/.openclaw/agents/main/sessions/3a44d717-bec8-4fe6-9128-3a8771a6fab1.jsonl"
2026-06-05T15:06:11.059+03:00 Embedded agent failed before reply: ...

Note the timing precision: compaction complete at .025, lane=main throws at .041 (16 ms later), lane=session:… throws at .043 (2 ms after that). Both lanes had the same fence snapshot from before compaction started; compaction swapped/rewrote the file; both lanes' next withSessionWriteLock calls trip assertSessionFileFence.

Root cause (read from the installed dist/)

Files referenced as they appear in 2026.5.20:

  1. pi-embedded-CJ87lW5R.js:2682+ — context overflow detected mid-turn → compactContextEngineWithSafetyTimeout → on success, adoptCompactionTranscript(compactResult).
  2. pi-embedded-CJ87lW5R.js:2214-2218:
    const adoptCompactionTranscript = (compactResult) => {
        const nextSessionId = compactResult.result?.sessionId;
        const nextSessionFile = compactResult.result?.sessionFile;
        if (nextSessionId && nextSessionId !== activeSessionId) activeSessionId = nextSessionId;
        if (nextSessionFile && nextSessionFile !== activeSessionFile) activeSessionFile = nextSessionFile;
    };
    This swaps activeSessionFile in-process but does not refresh the lock-guard's fence fingerprint — there is no call into selection-BmjEdnnA.js:refreshAfterOwnedSessionWrite() or any equivalent.
  3. selection-BmjEdnnA.js:7815-7825changeLooksLikeOwnedPromptOutput:
    if (!params.previous?.exists || !params.current.exists
        || !sameSessionFileIdentity(params.previous, params.current)
        || params.current.size < params.previous.size) return false;
    Compaction can (a) shrink the file (rewrite-in-place) or (b) change ino/dev (rewrite-then-rename or write to a different path). Both cases fail this check immediately → fence throws → EmbeddedAttemptSessionTakeoverError.

So the bug is: the lock-guard fence doesn't know that the run itself just rewrote the session file via compaction, and treats its own write as a foreign takeover.

Why this matters

  • Silent user-facing failure. No retry, no informative message — just the generic "Something went wrong" reply. Overlaps with EmbeddedAttemptSessionTakeoverError can silently terminate a run without user-visible reply #89734.
  • Reproducible on every long-session WhatsApp DM once the threshold is crossed. Disabling memoryFlush (the other path into the same race, fixed for us by config) does not mitigate this because auto-compaction at reason=threshold has no off-switch.
  • The user loses a turn's worth of work (61s of compaction + the model output that the parent run was about to produce).

Suggested fixes (in order of cost)

  1. Refresh the fence after adoptCompactionTranscript. Whoever swaps activeSessionFile should also call lockGuard.refreshAfterOwnedSessionWrite() (or expose a refreshForOwnedSessionReplace(path) variant that re-snapshots the fence against the new path). Probably the smallest patch.
  2. Extend changeLooksLikeOwnedPromptOutput with an "owned replacement" case: if the run holds an owned-compaction marker, accept any post-compaction fingerprint (including new ino, smaller size).
  3. Hold the write lock across compaction. Removes the lock-free window entirely, at the cost of blocking concurrent paired-lane reads while the LLM summarization runs (61s in our case — likely unacceptable).

I think (1) is the right shape — it's the symmetric counterpart of the refreshAfterOwnedSessionWrite() that already exists for append-style owned writes.

Environment

  • OpenClaw 2026.5.20 (self-hosted Docker)
  • Node v24.14.0
  • Host: Docker Desktop on macOS, virtiofs mounts
  • Provider: a custom local-host provider (Anthropic-compatible endpoint reachable from the container), Anthropic-family model, Anthropic-style fallback chain
  • Single operator, WhatsApp DM channel
  • agents.defaults.compaction.memoryFlush.enabled = false
  • agents.defaults.compaction.softThresholdTokens = 160000
  • session.reset = {mode: "idle", idleMinutes: 10080} (so sessions accumulate for up to a week)

Happy to grab additional logs / fs_usage / a lsof snapshot of the session file at the moment of compaction if it would help narrow the rewrite-vs-replace distinction further.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions