EmbeddedAttemptSessionTakeoverError races between heartbeat lane and channel/direct lane on same session file (internal ref #83510)

### Problem

`EmbeddedAttemptSessionTakeoverError` is fired when two embedded runs concurrently access the same session file — typically an agent's heartbeat lane racing a channel or direct lane on the same `sessions/<uuid>.jsonl`. The session-lock controller releases the write-lock around every provider stream call (`releaseForPrompt()` in `attempt.session-lock.ts`); during that release window, the other lane's writes mutate the file, the fence fingerprint (`dev/ino/size/mtimeNs/ctimeNs`) changes, and the original lane throws on reacquire.

In `failover-error.ts`, this is correctly classified via `isNonProviderRuntimeCoordinationError`, so the model-fallback chain aborts on it by design (no retries burned on the same provider). In practice this means the racing duplicate run on the same session file usually finishes and the user still gets a reply — but **~6% of affected turns propagate as user-visible `Embedded agent failed before reply: All models failed`**, so the reply silently drops.

A pre-existing comment in `attempt.session-lock.ts` references internal issue `#83510`, so this is known but unfixed.

### Reproduction

Any agent whose heartbeat is **not** marked `isolatedSession: true` and whose lanes share a session UUID:

1. Configure an agent whose `agents.list[].heartbeat` has `isolatedSession: false` (or omits it; default behavior depends on the agent template).
2. While the heartbeat embedded run is mid-stream (lock released for the provider call), an inbound chat event arrives on a lane that resolves to the same session file.
3. The chat-handler lane writes user-context entries; the heartbeat lane reacquires, fence mismatches, throws `EmbeddedAttemptSessionTakeoverError`.

Across a 3-day window in one observed deployment, **122 occurrences** were logged across 37 distinct session files. Lane histogram:

| Lane class | Count |
|---|---|
| synthetic `main` mirror | 60 |
| `session:agent:<id>:main:heartbeat` | 36 |
| `session:agent:<id>:<channel-type>:channel:<id-A>` | 20 |
| `session:agent:<id>:<channel-type>:direct:<user>` | 3 |
| `session:agent:<id>:cron:…` | 1 |
| `session:agent:<other-id>:<channel-type>:channel:…` | 1 |
| `cron-nested` | 1 |

The worst single session accumulated 32 hits on one channel-bound UUID; another got 8 hits at the same channel. **Heartbeat overlapping channel is the dominant pattern**, not user-side double-send.

A specific reproducible trace from one session UUID: the gateway's stuck-session-recovery aborted an active embedded run and immediately re-fired a new run on the same session file, then both racing copies took turns invalidating each other's fence. `durationMs` values up to 1,303,727 ms (~22 min) confirm long-running embedded runs are exactly the ones that get stomped.

### Code reference

- `src/agents/pi-embedded-runner/run/attempt.session-lock.ts` — `EmbeddedAttemptSessionTakeoverError` class; fence comparison in `assertSessionFileFence`; `releaseForPrompt` / reacquire path that builds the fingerprint.
- `src/agents/failover-error.ts` — `isNonProviderRuntimeCoordinationError` classifier with comment referencing internal `#83510`.
- `src/agents/pi-embedded-runner/google-prompt-cache.ts` — imports + catches `EmbeddedAttemptSessionTakeoverError`; second observation point.

The takeover detection itself works as designed; the bug is upstream — two lanes that **should not** be sharing the same session file are. The stuck-recovery path is the most reproducible contender: it kicks off a new embedded run on a session UUID that still has an active embedded prompt lock from the original run.

### Proposed change

Two-part fix:

**Part A — `attempt.session-lock.ts`: refuse to start a new embedded run on a session that already has one active.**

Maintain a per-session-file registry of active embedded-prompt holders (in-process Map keyed by absolute path). When `acquireForPrompt` is invoked, check the registry; if a prior holder hasn't released, either wait on a promise the prior holder resolves on release, or escalate with a typed `EmbeddedAttemptSessionContendedError` so the caller can pick a fresh UUID.

```ts
const ACTIVE_EMBEDDED_PROMPTS = new Map<string, Promise<void>>();

async function acquireForPrompt(sessionFile: string, ...): Promise<SessionLock> {
  const existing = ACTIVE_EMBEDDED_PROMPTS.get(sessionFile);
  if (existing) {
    if (onContention === "wait") {
      await existing;
    } else {
      throw new EmbeddedAttemptSessionContendedError(sessionFile);
    }
  }
  let releaseSignal: () => void;
  ACTIVE_EMBEDDED_PROMPTS.set(sessionFile, new Promise(r => (releaseSignal = r)));
  // ... existing logic ...
  // on final release / completion / takeover, releaseSignal() and ACTIVE_EMBEDDED_PROMPTS.delete(sessionFile)
}
```

**Part B — stuck-session-recovery: never re-fire on a session with an active embedded-prompt holder.**

The `abort_embedded_run` recovery path should consult the same `ACTIVE_EMBEDDED_PROMPTS` registry before retriggering. If a holder exists, recovery should wait for natural release (timeout-bounded) or escalate to a fresh session UUID rather than racing.

Optional Part C — for agents without `heartbeat.isolatedSession: true`, the heartbeat lane competes with channel/direct lanes on the same UUID. Consider either making `isolatedSession: true` the default for new agent templates, or surfacing a warning at config-validation time so operators are aware of the contention surface.

### Why

The race is reproducible in any multi-lane agent without isolated heartbeats. Six percent of contended turns silently drop the user-facing reply — significant on any chat surface where missed replies look like the bot ignoring the user. The classification path in `failover-error.ts` already treats takeover as a coordination error rather than a provider failure; the fix is to prevent the race from happening rather than handle it gracefully after the fact.

A coordination-only change (no behavior change for callers that don't contend) keeps the takeover-detection mechanism intact as a safety net while removing the actual cause.


Lane class	Count
synthetic `main` mirror	60
`session:agent:<id>:main:heartbeat`	36
`session:agent:<id>:<channel-type>:channel:<id-A>`	20
`session:agent:<id>:<channel-type>:direct:<user>`	3
`session:agent:<id>:cron:…`	1
`session:agent:<other-id>:<channel-type>:channel:…`	1
`cron-nested`	1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EmbeddedAttemptSessionTakeoverError races between heartbeat lane and channel/direct lane on same session file (internal ref #83510) #85913

Problem

Reproduction

Code reference

Proposed change

Why

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

EmbeddedAttemptSessionTakeoverError races between heartbeat lane and channel/direct lane on same session file (internal ref #83510) #85913

Description

Problem

Reproduction

Code reference

Proposed change

Why

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions