Problem
EmbeddedAttemptSessionTakeoverError is fired when two embedded runs concurrently access the same session file — typically an agent's heartbeat lane racing a channel or direct lane on the same sessions/<uuid>.jsonl. The session-lock controller releases the write-lock around every provider stream call (releaseForPrompt() in attempt.session-lock.ts); during that release window, the other lane's writes mutate the file, the fence fingerprint (dev/ino/size/mtimeNs/ctimeNs) changes, and the original lane throws on reacquire.
In failover-error.ts, this is correctly classified via isNonProviderRuntimeCoordinationError, so the model-fallback chain aborts on it by design (no retries burned on the same provider). In practice this means the racing duplicate run on the same session file usually finishes and the user still gets a reply — but ~6% of affected turns propagate as user-visible Embedded agent failed before reply: All models failed, so the reply silently drops.
A pre-existing comment in attempt.session-lock.ts references internal issue #83510, so this is known but unfixed.
Reproduction
Any agent whose heartbeat is not marked isolatedSession: true and whose lanes share a session UUID:
- Configure an agent whose
agents.list[].heartbeat has isolatedSession: false (or omits it; default behavior depends on the agent template).
- While the heartbeat embedded run is mid-stream (lock released for the provider call), an inbound chat event arrives on a lane that resolves to the same session file.
- The chat-handler lane writes user-context entries; the heartbeat lane reacquires, fence mismatches, throws
EmbeddedAttemptSessionTakeoverError.
Across a 3-day window in one observed deployment, 122 occurrences were logged across 37 distinct session files. Lane histogram:
| Lane class |
Count |
synthetic main mirror |
60 |
session:agent:<id>:main:heartbeat |
36 |
session:agent:<id>:<channel-type>:channel:<id-A> |
20 |
session:agent:<id>:<channel-type>:direct:<user> |
3 |
session:agent:<id>:cron:… |
1 |
session:agent:<other-id>:<channel-type>:channel:… |
1 |
cron-nested |
1 |
The worst single session accumulated 32 hits on one channel-bound UUID; another got 8 hits at the same channel. Heartbeat overlapping channel is the dominant pattern, not user-side double-send.
A specific reproducible trace from one session UUID: the gateway's stuck-session-recovery aborted an active embedded run and immediately re-fired a new run on the same session file, then both racing copies took turns invalidating each other's fence. durationMs values up to 1,303,727 ms (~22 min) confirm long-running embedded runs are exactly the ones that get stomped.
Code reference
src/agents/pi-embedded-runner/run/attempt.session-lock.ts — EmbeddedAttemptSessionTakeoverError class; fence comparison in assertSessionFileFence; releaseForPrompt / reacquire path that builds the fingerprint.
src/agents/failover-error.ts — isNonProviderRuntimeCoordinationError classifier with comment referencing internal #83510.
src/agents/pi-embedded-runner/google-prompt-cache.ts — imports + catches EmbeddedAttemptSessionTakeoverError; second observation point.
The takeover detection itself works as designed; the bug is upstream — two lanes that should not be sharing the same session file are. The stuck-recovery path is the most reproducible contender: it kicks off a new embedded run on a session UUID that still has an active embedded prompt lock from the original run.
Proposed change
Two-part fix:
Part A — attempt.session-lock.ts: refuse to start a new embedded run on a session that already has one active.
Maintain a per-session-file registry of active embedded-prompt holders (in-process Map keyed by absolute path). When acquireForPrompt is invoked, check the registry; if a prior holder hasn't released, either wait on a promise the prior holder resolves on release, or escalate with a typed EmbeddedAttemptSessionContendedError so the caller can pick a fresh UUID.
const ACTIVE_EMBEDDED_PROMPTS = new Map<string, Promise<void>>();
async function acquireForPrompt(sessionFile: string, ...): Promise<SessionLock> {
const existing = ACTIVE_EMBEDDED_PROMPTS.get(sessionFile);
if (existing) {
if (onContention === "wait") {
await existing;
} else {
throw new EmbeddedAttemptSessionContendedError(sessionFile);
}
}
let releaseSignal: () => void;
ACTIVE_EMBEDDED_PROMPTS.set(sessionFile, new Promise(r => (releaseSignal = r)));
// ... existing logic ...
// on final release / completion / takeover, releaseSignal() and ACTIVE_EMBEDDED_PROMPTS.delete(sessionFile)
}
Part B — stuck-session-recovery: never re-fire on a session with an active embedded-prompt holder.
The abort_embedded_run recovery path should consult the same ACTIVE_EMBEDDED_PROMPTS registry before retriggering. If a holder exists, recovery should wait for natural release (timeout-bounded) or escalate to a fresh session UUID rather than racing.
Optional Part C — for agents without heartbeat.isolatedSession: true, the heartbeat lane competes with channel/direct lanes on the same UUID. Consider either making isolatedSession: true the default for new agent templates, or surfacing a warning at config-validation time so operators are aware of the contention surface.
Why
The race is reproducible in any multi-lane agent without isolated heartbeats. Six percent of contended turns silently drop the user-facing reply — significant on any chat surface where missed replies look like the bot ignoring the user. The classification path in failover-error.ts already treats takeover as a coordination error rather than a provider failure; the fix is to prevent the race from happening rather than handle it gracefully after the fact.
A coordination-only change (no behavior change for callers that don't contend) keeps the takeover-detection mechanism intact as a safety net while removing the actual cause.
Problem
EmbeddedAttemptSessionTakeoverErroris fired when two embedded runs concurrently access the same session file — typically an agent's heartbeat lane racing a channel or direct lane on the samesessions/<uuid>.jsonl. The session-lock controller releases the write-lock around every provider stream call (releaseForPrompt()inattempt.session-lock.ts); during that release window, the other lane's writes mutate the file, the fence fingerprint (dev/ino/size/mtimeNs/ctimeNs) changes, and the original lane throws on reacquire.In
failover-error.ts, this is correctly classified viaisNonProviderRuntimeCoordinationError, so the model-fallback chain aborts on it by design (no retries burned on the same provider). In practice this means the racing duplicate run on the same session file usually finishes and the user still gets a reply — but ~6% of affected turns propagate as user-visibleEmbedded agent failed before reply: All models failed, so the reply silently drops.A pre-existing comment in
attempt.session-lock.tsreferences internal issue#83510, so this is known but unfixed.Reproduction
Any agent whose heartbeat is not marked
isolatedSession: trueand whose lanes share a session UUID:agents.list[].heartbeathasisolatedSession: false(or omits it; default behavior depends on the agent template).EmbeddedAttemptSessionTakeoverError.Across a 3-day window in one observed deployment, 122 occurrences were logged across 37 distinct session files. Lane histogram:
mainmirrorsession:agent:<id>:main:heartbeatsession:agent:<id>:<channel-type>:channel:<id-A>session:agent:<id>:<channel-type>:direct:<user>session:agent:<id>:cron:…session:agent:<other-id>:<channel-type>:channel:…cron-nestedThe worst single session accumulated 32 hits on one channel-bound UUID; another got 8 hits at the same channel. Heartbeat overlapping channel is the dominant pattern, not user-side double-send.
A specific reproducible trace from one session UUID: the gateway's stuck-session-recovery aborted an active embedded run and immediately re-fired a new run on the same session file, then both racing copies took turns invalidating each other's fence.
durationMsvalues up to 1,303,727 ms (~22 min) confirm long-running embedded runs are exactly the ones that get stomped.Code reference
src/agents/pi-embedded-runner/run/attempt.session-lock.ts—EmbeddedAttemptSessionTakeoverErrorclass; fence comparison inassertSessionFileFence;releaseForPrompt/ reacquire path that builds the fingerprint.src/agents/failover-error.ts—isNonProviderRuntimeCoordinationErrorclassifier with comment referencing internal#83510.src/agents/pi-embedded-runner/google-prompt-cache.ts— imports + catchesEmbeddedAttemptSessionTakeoverError; second observation point.The takeover detection itself works as designed; the bug is upstream — two lanes that should not be sharing the same session file are. The stuck-recovery path is the most reproducible contender: it kicks off a new embedded run on a session UUID that still has an active embedded prompt lock from the original run.
Proposed change
Two-part fix:
Part A —
attempt.session-lock.ts: refuse to start a new embedded run on a session that already has one active.Maintain a per-session-file registry of active embedded-prompt holders (in-process Map keyed by absolute path). When
acquireForPromptis invoked, check the registry; if a prior holder hasn't released, either wait on a promise the prior holder resolves on release, or escalate with a typedEmbeddedAttemptSessionContendedErrorso the caller can pick a fresh UUID.Part B — stuck-session-recovery: never re-fire on a session with an active embedded-prompt holder.
The
abort_embedded_runrecovery path should consult the sameACTIVE_EMBEDDED_PROMPTSregistry before retriggering. If a holder exists, recovery should wait for natural release (timeout-bounded) or escalate to a fresh session UUID rather than racing.Optional Part C — for agents without
heartbeat.isolatedSession: true, the heartbeat lane competes with channel/direct lanes on the same UUID. Consider either makingisolatedSession: truethe default for new agent templates, or surfacing a warning at config-validation time so operators are aware of the contention surface.Why
The race is reproducible in any multi-lane agent without isolated heartbeats. Six percent of contended turns silently drop the user-facing reply — significant on any chat surface where missed replies look like the bot ignoring the user. The classification path in
failover-error.tsalready treats takeover as a coordination error rather than a provider failure; the fix is to prevent the race from happening rather than handle it gracefully after the fact.A coordination-only change (no behavior change for callers that don't contend) keeps the takeover-detection mechanism intact as a safety net while removing the actual cause.