Skip to content

fix: session write lock race — async release can delete newly-acquired lock #57019

@jamie8johnson

Description

@jamie8johnson

Session write lock race: release can destroy a freshly acquired lock

Summary

In src/agents/session-write-lock.ts, releaseHeldLock deletes the session key from the in-memory HELD_LOCKS Map (line 149) before starting the async file cleanup (handle.close() + fs.rm(), lines 150-161). This opens a race window where a concurrent acquireSessionWriteLock for the same session can acquire a new lock, only for the old release's deferred fs.rm to delete the new lock file out from under it.

Affected platforms

  • macOS and Windows: always exploitable. getProcessStartTime() returns null on non-Linux platforms, so lock files never contain starttime. This means shouldTreatAsOrphanSelfLock (line 389) always reaches its !HELD_LOCKS.has() check for same-PID contention.
  • Linux: exploitable only when /proc/<pid>/stat is unreadable (containers with restricted /proc, or unusual security policies), since a valid starttime in the payload causes shouldTreatAsOrphanSelfLock to return false early (line 398-399).

Step-by-step race scenario

Precondition: Process P holds a write lock on session S. Lock file exists at S.jsonl.lock with pid = P.pid and no starttime (macOS/Windows).

  1. P calls release() for session S — enters releaseHeldLock.
  2. Line 149: HELD_LOCKS.delete(normalizedSessionFile) executes synchronously. The Map no longer has session S.
  3. Line 150: The async IIFE starts — handle.close() begins but has not resolved.
  4. P calls acquireSessionWriteLock for session S (e.g., reentrant agent writing to the same transcript). A microtask or concurrent call from the same event loop tick.
  5. Line 503: fs.open(lockPath, "wx") fails with EEXIST — the old lock file still exists on disk.
  6. Line 541: readLockPayload reads the old lock file. Payload has pid = process.pid, no starttime.
  7. Line 544: shouldTreatAsOrphanSelfLock is called:
    • pid === process.pid — true (same process)
    • hasValidStarttime — false (macOS, no /proc)
    • !HELD_LOCKS.has(normalizedSessionFile)true (deleted at step 2)
    • Returns true
  8. Line 548-556: reclaimDetails is constructed with stale: true forced and "orphan-self-pid" added to reasons.
  9. Line 557: shouldReclaimContendedLockFile returns true (stale is true, reasons include "orphan-self-pid" which doesn't match the mtime-fallback filter).
  10. Line 558: fs.rm(lockPath, { force: true }) — deletes the old lock file. continue re-enters the loop.
  11. Line 503: fs.open(lockPath, "wx") succeeds — new lock file created, written, stored in HELD_LOCKS.
  12. Meanwhile: The old release's async IIFE (from step 3) resumes. handle.close() completes, then line 157: fs.rm(held.lockPath, { force: true }) executes — deleting the new lock file.

Result: Process P believes it holds the session write lock (entry exists in HELD_LOCKS with a valid handle), but the lock file no longer exists on disk. Any other process can now open("wx") the same path and obtain a concurrent "lock" on the same session.

Impact

  • Session transcript corruption: Two processes (or two agents in the same process) can write concurrently to the same .jsonl session file, interleaving or overwriting each other's JSON lines.
  • Silent failure: The lock holder has no indication that its lock file was deleted. The HELD_LOCKS entry still exists, the handle is still open (pointing at a now-unlinked inode on Linux, or a deleted-but-cached file on macOS/Windows).
  • Watchdog blind spot: The watchdog timer checks HELD_LOCKS entries but doesn't verify the lock file still exists on disk.

Suggested fixes

Option A — "releasing" guard set (minimal change):

Add a RELEASING_LOCKS: Set<string> alongside HELD_LOCKS. In releaseHeldLock, add the key to RELEASING_LOCKS before deleting from HELD_LOCKS, and remove it from RELEASING_LOCKS after the async fs.rm completes. In shouldTreatAsOrphanSelfLock, return false if RELEASING_LOCKS.has(normalizedSessionFile).

const RELEASING_LOCKS = new Set<string>();

// In releaseHeldLock, around line 149:
RELEASING_LOCKS.add(normalizedSessionFile);
HELD_LOCKS.delete(normalizedSessionFile);
held.releasePromise = (async () => {
  try { await held.handle.close(); } catch {}
  try { await fs.rm(held.lockPath, { force: true }); } catch {}
  finally { RELEASING_LOCKS.delete(normalizedSessionFile); }
})();

// In shouldTreatAsOrphanSelfLock:
if (RELEASING_LOCKS.has(params.normalizedSessionFile)) {
  return false;
}

Option B — move delete after async IO (simpler but changes semantics):

Move HELD_LOCKS.delete(normalizedSessionFile) to after the fs.rm completes (inside the async IIFE, after line 157). This keeps the entry visible in HELD_LOCKS during the entire cleanup, so shouldTreatAsOrphanSelfLock returns false and the acquirer waits/retries normally.

The downside is that HELD_LOCKS.size remains non-zero during cleanup, which keeps the watchdog running slightly longer, but this is harmless.

Option B is simpler and more correct — the lock should be considered "held" until cleanup actually finishes, not just until cleanup starts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions