Session write lock race: release can destroy a freshly acquired lock
Summary
In src/agents/session-write-lock.ts, releaseHeldLock deletes the session key from the in-memory HELD_LOCKS Map (line 149) before starting the async file cleanup (handle.close() + fs.rm(), lines 150-161). This opens a race window where a concurrent acquireSessionWriteLock for the same session can acquire a new lock, only for the old release's deferred fs.rm to delete the new lock file out from under it.
Affected platforms
- macOS and Windows: always exploitable.
getProcessStartTime() returns null on non-Linux platforms, so lock files never contain starttime. This means shouldTreatAsOrphanSelfLock (line 389) always reaches its !HELD_LOCKS.has() check for same-PID contention.
- Linux: exploitable only when
/proc/<pid>/stat is unreadable (containers with restricted /proc, or unusual security policies), since a valid starttime in the payload causes shouldTreatAsOrphanSelfLock to return false early (line 398-399).
Step-by-step race scenario
Precondition: Process P holds a write lock on session S. Lock file exists at S.jsonl.lock with pid = P.pid and no starttime (macOS/Windows).
- P calls
release() for session S — enters releaseHeldLock.
- Line 149:
HELD_LOCKS.delete(normalizedSessionFile) executes synchronously. The Map no longer has session S.
- Line 150: The async IIFE starts —
handle.close() begins but has not resolved.
- P calls
acquireSessionWriteLock for session S (e.g., reentrant agent writing to the same transcript). A microtask or concurrent call from the same event loop tick.
- Line 503:
fs.open(lockPath, "wx") fails with EEXIST — the old lock file still exists on disk.
- Line 541:
readLockPayload reads the old lock file. Payload has pid = process.pid, no starttime.
- Line 544:
shouldTreatAsOrphanSelfLock is called:
pid === process.pid — true (same process)
hasValidStarttime — false (macOS, no /proc)
!HELD_LOCKS.has(normalizedSessionFile) — true (deleted at step 2)
- Returns true
- Line 548-556:
reclaimDetails is constructed with stale: true forced and "orphan-self-pid" added to reasons.
- Line 557:
shouldReclaimContendedLockFile returns true (stale is true, reasons include "orphan-self-pid" which doesn't match the mtime-fallback filter).
- Line 558:
fs.rm(lockPath, { force: true }) — deletes the old lock file. continue re-enters the loop.
- Line 503:
fs.open(lockPath, "wx") succeeds — new lock file created, written, stored in HELD_LOCKS.
- Meanwhile: The old release's async IIFE (from step 3) resumes.
handle.close() completes, then line 157: fs.rm(held.lockPath, { force: true }) executes — deleting the new lock file.
Result: Process P believes it holds the session write lock (entry exists in HELD_LOCKS with a valid handle), but the lock file no longer exists on disk. Any other process can now open("wx") the same path and obtain a concurrent "lock" on the same session.
Impact
- Session transcript corruption: Two processes (or two agents in the same process) can write concurrently to the same
.jsonl session file, interleaving or overwriting each other's JSON lines.
- Silent failure: The lock holder has no indication that its lock file was deleted. The
HELD_LOCKS entry still exists, the handle is still open (pointing at a now-unlinked inode on Linux, or a deleted-but-cached file on macOS/Windows).
- Watchdog blind spot: The watchdog timer checks
HELD_LOCKS entries but doesn't verify the lock file still exists on disk.
Suggested fixes
Option A — "releasing" guard set (minimal change):
Add a RELEASING_LOCKS: Set<string> alongside HELD_LOCKS. In releaseHeldLock, add the key to RELEASING_LOCKS before deleting from HELD_LOCKS, and remove it from RELEASING_LOCKS after the async fs.rm completes. In shouldTreatAsOrphanSelfLock, return false if RELEASING_LOCKS.has(normalizedSessionFile).
const RELEASING_LOCKS = new Set<string>();
// In releaseHeldLock, around line 149:
RELEASING_LOCKS.add(normalizedSessionFile);
HELD_LOCKS.delete(normalizedSessionFile);
held.releasePromise = (async () => {
try { await held.handle.close(); } catch {}
try { await fs.rm(held.lockPath, { force: true }); } catch {}
finally { RELEASING_LOCKS.delete(normalizedSessionFile); }
})();
// In shouldTreatAsOrphanSelfLock:
if (RELEASING_LOCKS.has(params.normalizedSessionFile)) {
return false;
}
Option B — move delete after async IO (simpler but changes semantics):
Move HELD_LOCKS.delete(normalizedSessionFile) to after the fs.rm completes (inside the async IIFE, after line 157). This keeps the entry visible in HELD_LOCKS during the entire cleanup, so shouldTreatAsOrphanSelfLock returns false and the acquirer waits/retries normally.
The downside is that HELD_LOCKS.size remains non-zero during cleanup, which keeps the watchdog running slightly longer, but this is harmless.
Option B is simpler and more correct — the lock should be considered "held" until cleanup actually finishes, not just until cleanup starts.
Session write lock race: release can destroy a freshly acquired lock
Summary
In
src/agents/session-write-lock.ts,releaseHeldLockdeletes the session key from the in-memoryHELD_LOCKSMap (line 149) before starting the async file cleanup (handle.close()+fs.rm(), lines 150-161). This opens a race window where a concurrentacquireSessionWriteLockfor the same session can acquire a new lock, only for the old release's deferredfs.rmto delete the new lock file out from under it.Affected platforms
getProcessStartTime()returnsnullon non-Linux platforms, so lock files never containstarttime. This meansshouldTreatAsOrphanSelfLock(line 389) always reaches its!HELD_LOCKS.has()check for same-PID contention./proc/<pid>/statis unreadable (containers with restricted/proc, or unusual security policies), since a validstarttimein the payload causesshouldTreatAsOrphanSelfLockto returnfalseearly (line 398-399).Step-by-step race scenario
Precondition: Process P holds a write lock on session S. Lock file exists at
S.jsonl.lockwithpid = P.pidand nostarttime(macOS/Windows).release()for session S — entersreleaseHeldLock.HELD_LOCKS.delete(normalizedSessionFile)executes synchronously. The Map no longer has session S.handle.close()begins but has not resolved.acquireSessionWriteLockfor session S (e.g., reentrant agent writing to the same transcript). A microtask or concurrent call from the same event loop tick.fs.open(lockPath, "wx")fails withEEXIST— the old lock file still exists on disk.readLockPayloadreads the old lock file. Payload haspid = process.pid, nostarttime.shouldTreatAsOrphanSelfLockis called:pid === process.pid— true (same process)hasValidStarttime— false (macOS, no/proc)!HELD_LOCKS.has(normalizedSessionFile)— true (deleted at step 2)reclaimDetailsis constructed withstale: trueforced and"orphan-self-pid"added to reasons.shouldReclaimContendedLockFilereturnstrue(stale is true, reasons include"orphan-self-pid"which doesn't match the mtime-fallback filter).fs.rm(lockPath, { force: true })— deletes the old lock file.continuere-enters the loop.fs.open(lockPath, "wx")succeeds — new lock file created, written, stored inHELD_LOCKS.handle.close()completes, then line 157:fs.rm(held.lockPath, { force: true })executes — deleting the new lock file.Result: Process P believes it holds the session write lock (entry exists in
HELD_LOCKSwith a valid handle), but the lock file no longer exists on disk. Any other process can nowopen("wx")the same path and obtain a concurrent "lock" on the same session.Impact
.jsonlsession file, interleaving or overwriting each other's JSON lines.HELD_LOCKSentry still exists, the handle is still open (pointing at a now-unlinked inode on Linux, or a deleted-but-cached file on macOS/Windows).HELD_LOCKSentries but doesn't verify the lock file still exists on disk.Suggested fixes
Option A — "releasing" guard set (minimal change):
Add a
RELEASING_LOCKS: Set<string>alongsideHELD_LOCKS. InreleaseHeldLock, add the key toRELEASING_LOCKSbefore deleting fromHELD_LOCKS, and remove it fromRELEASING_LOCKSafter the asyncfs.rmcompletes. InshouldTreatAsOrphanSelfLock, returnfalseifRELEASING_LOCKS.has(normalizedSessionFile).Option B — move delete after async IO (simpler but changes semantics):
Move
HELD_LOCKS.delete(normalizedSessionFile)to after thefs.rmcompletes (inside the async IIFE, after line 157). This keeps the entry visible inHELD_LOCKSduring the entire cleanup, soshouldTreatAsOrphanSelfLockreturnsfalseand the acquirer waits/retries normally.The downside is that
HELD_LOCKS.sizeremains non-zero during cleanup, which keeps the watchdog running slightly longer, but this is harmless.Option B is simpler and more correct — the lock should be considered "held" until cleanup actually finishes, not just until cleanup starts.