fix: session write lock race — async release can delete newly-acquired lock

## Session write lock race: release can destroy a freshly acquired lock

### Summary

In `src/agents/session-write-lock.ts`, `releaseHeldLock` deletes the session key from the in-memory `HELD_LOCKS` Map (line 149) **before** starting the async file cleanup (`handle.close()` + `fs.rm()`, lines 150-161). This opens a race window where a concurrent `acquireSessionWriteLock` for the same session can acquire a new lock, only for the old release's deferred `fs.rm` to delete the new lock file out from under it.

### Affected platforms

- **macOS and Windows**: always exploitable. `getProcessStartTime()` returns `null` on non-Linux platforms, so lock files never contain `starttime`. This means `shouldTreatAsOrphanSelfLock` (line 389) always reaches its `!HELD_LOCKS.has()` check for same-PID contention.
- **Linux**: exploitable only when `/proc/<pid>/stat` is unreadable (containers with restricted `/proc`, or unusual security policies), since a valid `starttime` in the payload causes `shouldTreatAsOrphanSelfLock` to return `false` early (line 398-399).

### Step-by-step race scenario

Precondition: Process P holds a write lock on session S. Lock file exists at `S.jsonl.lock` with `pid = P.pid` and no `starttime` (macOS/Windows).

1. **P calls `release()` for session S** — enters `releaseHeldLock`.
2. **Line 149**: `HELD_LOCKS.delete(normalizedSessionFile)` executes synchronously. The Map no longer has session S.
3. **Line 150**: The async IIFE starts — `handle.close()` begins but has not resolved.
4. **P calls `acquireSessionWriteLock` for session S** (e.g., reentrant agent writing to the same transcript). A microtask or concurrent call from the same event loop tick.
5. **Line 503**: `fs.open(lockPath, "wx")` fails with `EEXIST` — the old lock file still exists on disk.
6. **Line 541**: `readLockPayload` reads the old lock file. Payload has `pid = process.pid`, no `starttime`.
7. **Line 544**: `shouldTreatAsOrphanSelfLock` is called:
   - `pid === process.pid` — true (same process)
   - `hasValidStarttime` — false (macOS, no `/proc`)
   - `!HELD_LOCKS.has(normalizedSessionFile)` — **true** (deleted at step 2)
   - Returns **true**
8. **Line 548-556**: `reclaimDetails` is constructed with `stale: true` forced and `"orphan-self-pid"` added to reasons.
9. **Line 557**: `shouldReclaimContendedLockFile` returns `true` (stale is true, reasons include `"orphan-self-pid"` which doesn't match the mtime-fallback filter).
10. **Line 558**: `fs.rm(lockPath, { force: true })` — deletes the old lock file. `continue` re-enters the loop.
11. **Line 503**: `fs.open(lockPath, "wx")` succeeds — new lock file created, written, stored in `HELD_LOCKS`.
12. **Meanwhile**: The old release's async IIFE (from step 3) resumes. `handle.close()` completes, then **line 157**: `fs.rm(held.lockPath, { force: true })` executes — deleting the **new** lock file.

**Result**: Process P believes it holds the session write lock (entry exists in `HELD_LOCKS` with a valid handle), but the lock file no longer exists on disk. Any other process can now `open("wx")` the same path and obtain a concurrent "lock" on the same session.

### Impact

- **Session transcript corruption**: Two processes (or two agents in the same process) can write concurrently to the same `.jsonl` session file, interleaving or overwriting each other's JSON lines.
- **Silent failure**: The lock holder has no indication that its lock file was deleted. The `HELD_LOCKS` entry still exists, the handle is still open (pointing at a now-unlinked inode on Linux, or a deleted-but-cached file on macOS/Windows).
- **Watchdog blind spot**: The watchdog timer checks `HELD_LOCKS` entries but doesn't verify the lock file still exists on disk.

### Suggested fixes

**Option A — "releasing" guard set** (minimal change):

Add a `RELEASING_LOCKS: Set<string>` alongside `HELD_LOCKS`. In `releaseHeldLock`, add the key to `RELEASING_LOCKS` *before* deleting from `HELD_LOCKS`, and remove it from `RELEASING_LOCKS` *after* the async `fs.rm` completes. In `shouldTreatAsOrphanSelfLock`, return `false` if `RELEASING_LOCKS.has(normalizedSessionFile)`.

```typescript
const RELEASING_LOCKS = new Set<string>();

// In releaseHeldLock, around line 149:
RELEASING_LOCKS.add(normalizedSessionFile);
HELD_LOCKS.delete(normalizedSessionFile);
held.releasePromise = (async () => {
  try { await held.handle.close(); } catch {}
  try { await fs.rm(held.lockPath, { force: true }); } catch {}
  finally { RELEASING_LOCKS.delete(normalizedSessionFile); }
})();

// In shouldTreatAsOrphanSelfLock:
if (RELEASING_LOCKS.has(params.normalizedSessionFile)) {
  return false;
}
```

**Option B — move delete after async IO** (simpler but changes semantics):

Move `HELD_LOCKS.delete(normalizedSessionFile)` to after the `fs.rm` completes (inside the async IIFE, after line 157). This keeps the entry visible in `HELD_LOCKS` during the entire cleanup, so `shouldTreatAsOrphanSelfLock` returns `false` and the acquirer waits/retries normally.

The downside is that `HELD_LOCKS.size` remains non-zero during cleanup, which keeps the watchdog running slightly longer, but this is harmless.

**Option B is simpler and more correct** — the lock should be considered "held" until cleanup actually finishes, not just until cleanup starts.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: session write lock race — async release can delete newly-acquired lock #57019

Session write lock race: release can destroy a freshly acquired lock

Summary

Affected platforms

Step-by-step race scenario

Impact

Suggested fixes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

fix: session write lock race — async release can delete newly-acquired lock #57019

Description

Session write lock race: release can destroy a freshly acquired lock

Summary

Affected platforms

Step-by-step race scenario

Impact

Suggested fixes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions