Skip to content

EmbeddedAttemptSessionTakeoverError races between heartbeat lane and channel/direct lane on same session file (internal ref #83510) #85913

@ubehera

Description

@ubehera

Problem

EmbeddedAttemptSessionTakeoverError is fired when two embedded runs concurrently access the same session file — typically an agent's heartbeat lane racing a channel or direct lane on the same sessions/<uuid>.jsonl. The session-lock controller releases the write-lock around every provider stream call (releaseForPrompt() in attempt.session-lock.ts); during that release window, the other lane's writes mutate the file, the fence fingerprint (dev/ino/size/mtimeNs/ctimeNs) changes, and the original lane throws on reacquire.

In failover-error.ts, this is correctly classified via isNonProviderRuntimeCoordinationError, so the model-fallback chain aborts on it by design (no retries burned on the same provider). In practice this means the racing duplicate run on the same session file usually finishes and the user still gets a reply — but ~6% of affected turns propagate as user-visible Embedded agent failed before reply: All models failed, so the reply silently drops.

A pre-existing comment in attempt.session-lock.ts references internal issue #83510, so this is known but unfixed.

Reproduction

Any agent whose heartbeat is not marked isolatedSession: true and whose lanes share a session UUID:

  1. Configure an agent whose agents.list[].heartbeat has isolatedSession: false (or omits it; default behavior depends on the agent template).
  2. While the heartbeat embedded run is mid-stream (lock released for the provider call), an inbound chat event arrives on a lane that resolves to the same session file.
  3. The chat-handler lane writes user-context entries; the heartbeat lane reacquires, fence mismatches, throws EmbeddedAttemptSessionTakeoverError.

Across a 3-day window in one observed deployment, 122 occurrences were logged across 37 distinct session files. Lane histogram:

Lane class Count
synthetic main mirror 60
session:agent:<id>:main:heartbeat 36
session:agent:<id>:<channel-type>:channel:<id-A> 20
session:agent:<id>:<channel-type>:direct:<user> 3
session:agent:<id>:cron:… 1
session:agent:<other-id>:<channel-type>:channel:… 1
cron-nested 1

The worst single session accumulated 32 hits on one channel-bound UUID; another got 8 hits at the same channel. Heartbeat overlapping channel is the dominant pattern, not user-side double-send.

A specific reproducible trace from one session UUID: the gateway's stuck-session-recovery aborted an active embedded run and immediately re-fired a new run on the same session file, then both racing copies took turns invalidating each other's fence. durationMs values up to 1,303,727 ms (~22 min) confirm long-running embedded runs are exactly the ones that get stomped.

Code reference

  • src/agents/pi-embedded-runner/run/attempt.session-lock.tsEmbeddedAttemptSessionTakeoverError class; fence comparison in assertSessionFileFence; releaseForPrompt / reacquire path that builds the fingerprint.
  • src/agents/failover-error.tsisNonProviderRuntimeCoordinationError classifier with comment referencing internal #83510.
  • src/agents/pi-embedded-runner/google-prompt-cache.ts — imports + catches EmbeddedAttemptSessionTakeoverError; second observation point.

The takeover detection itself works as designed; the bug is upstream — two lanes that should not be sharing the same session file are. The stuck-recovery path is the most reproducible contender: it kicks off a new embedded run on a session UUID that still has an active embedded prompt lock from the original run.

Proposed change

Two-part fix:

Part A — attempt.session-lock.ts: refuse to start a new embedded run on a session that already has one active.

Maintain a per-session-file registry of active embedded-prompt holders (in-process Map keyed by absolute path). When acquireForPrompt is invoked, check the registry; if a prior holder hasn't released, either wait on a promise the prior holder resolves on release, or escalate with a typed EmbeddedAttemptSessionContendedError so the caller can pick a fresh UUID.

const ACTIVE_EMBEDDED_PROMPTS = new Map<string, Promise<void>>();

async function acquireForPrompt(sessionFile: string, ...): Promise<SessionLock> {
  const existing = ACTIVE_EMBEDDED_PROMPTS.get(sessionFile);
  if (existing) {
    if (onContention === "wait") {
      await existing;
    } else {
      throw new EmbeddedAttemptSessionContendedError(sessionFile);
    }
  }
  let releaseSignal: () => void;
  ACTIVE_EMBEDDED_PROMPTS.set(sessionFile, new Promise(r => (releaseSignal = r)));
  // ... existing logic ...
  // on final release / completion / takeover, releaseSignal() and ACTIVE_EMBEDDED_PROMPTS.delete(sessionFile)
}

Part B — stuck-session-recovery: never re-fire on a session with an active embedded-prompt holder.

The abort_embedded_run recovery path should consult the same ACTIVE_EMBEDDED_PROMPTS registry before retriggering. If a holder exists, recovery should wait for natural release (timeout-bounded) or escalate to a fresh session UUID rather than racing.

Optional Part C — for agents without heartbeat.isolatedSession: true, the heartbeat lane competes with channel/direct lanes on the same UUID. Consider either making isolatedSession: true the default for new agent templates, or surfacing a warning at config-validation time so operators are aware of the contention surface.

Why

The race is reproducible in any multi-lane agent without isolated heartbeats. Six percent of contended turns silently drop the user-facing reply — significant on any chat surface where missed replies look like the bot ignoring the user. The classification path in failover-error.ts already treats takeover as a coordination error rather than a provider failure; the fix is to prevent the race from happening rather than handle it gracefully after the fact.

A coordination-only change (no behavior change for callers that don't contend) keeps the takeover-detection mechanism intact as a safety net while removing the actual cause.

Metadata

Metadata

Assignees

Labels

P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions