Summary
There is a race condition in src/agents/session-write-lock.ts during the final release() path. The in-process re-entrancy map entry is deleted before async cleanup (closing the file handle + removing the lock file).
Bug details
On the final release, the code currently does:
HELD_LOCKS.delete(sessionKey)
await handle.close()
await fs.rm(lockPath)
During the async gap after step (1), a concurrent acquire in the same Node process sees no HELD_LOCKS entry but the lock file still exists on disk, so it falls back to the filesystem retry loop and can spin until the 10s acquire timeout.
This is especially likely if handle.close() is slow or rejects (today a close() rejection prevents rm() from running), leaving a persistent lock file while the pid is still alive.
How to reproduce
Run high concurrency work that frequently touches the same session file from multiple tasks in the same process (e.g. multiple crons + subagents + heartbeat) so releases and acquires interleave.
Our config that triggers this intermittently:
maxConcurrent: 4
subagents.maxConcurrent: 8
- 6 cron jobs
Impact
Intermittent 10s lock acquisition timeouts (seen as FailoverError / agent failures) which cascade into failed runs / missed cron work.
Proposed fix
Add an in-memory releasing promise state to the held lock entry. On final release, set held.releasing before any awaits and have acquires that observe a releasing state await it (instead of spinning on the filesystem lock file). Also ensure fs.rm() runs even if handle.close() fails (close wrapped in catch / finally).
Summary
There is a race condition in
src/agents/session-write-lock.tsduring the finalrelease()path. The in-process re-entrancy map entry is deleted before async cleanup (closing the file handle + removing the lock file).Bug details
On the final release, the code currently does:
HELD_LOCKS.delete(sessionKey)await handle.close()await fs.rm(lockPath)During the async gap after step (1), a concurrent acquire in the same Node process sees no
HELD_LOCKSentry but the lock file still exists on disk, so it falls back to the filesystem retry loop and can spin until the 10s acquire timeout.This is especially likely if
handle.close()is slow or rejects (today aclose()rejection preventsrm()from running), leaving a persistent lock file while the pid is still alive.How to reproduce
Run high concurrency work that frequently touches the same session file from multiple tasks in the same process (e.g. multiple crons + subagents + heartbeat) so releases and acquires interleave.
Our config that triggers this intermittently:
maxConcurrent: 4subagents.maxConcurrent: 8Impact
Intermittent 10s lock acquisition timeouts (seen as
FailoverError/ agent failures) which cascade into failed runs / missed cron work.Proposed fix
Add an in-memory
releasingpromise state to the held lock entry. On final release, setheld.releasingbefore any awaits and have acquires that observe areleasingstate await it (instead of spinning on the filesystem lock file). Also ensurefs.rm()runs even ifhandle.close()fails (close wrapped incatch/finally).