fix(agents): release session write lock if fence read throws on prompt release#89649
fix(agents): release session write lock if fence read throws on prompt release#89649spencer2211 wants to merge 1 commit into
Conversation
…t release releaseHeldLockWithFence (the embedded-attempt session-lock controller's release-for-prompt path) cleared its in-memory `heldLock` reference and then performed fence bookkeeping — readSessionFileFingerprint and readSessionFileFenceSnapshot — before calling lock.release(). Both perform filesystem I/O that can reject with a transient non-ENOENT error (e.g. EIO/EMFILE under load). Because release() was not in a finally, such a throw exited the function with `heldLock` already nulled and the underlying file lock never released. The lock then leaked on the live gateway process for the full maxHoldMs lease. Since the controller holds the lock with maxHoldMs derived from the agent timeout (~17 min) and the holder is the live pid (its /proc starttime matches the lock payload), the stale-lock breaker correctly refuses to reclaim it — only the watchdog TTL frees it. Meanwhile the attempt teardown still releases the in-process session-file owner mutex, so the next interactive turn sails past the mutex and blocks on the orphaned file lock, failing every turn with SessionWriteLockTimeoutError for up to 17 minutes (observed: a Discord channel dead 00:08–00:27). Wrap the fence bookkeeping in try/finally so the file lock is always released once heldLock is cleared, even if the fence reads throw. The original error still propagates; only the lock-release invariant is made exception-safe. Add regression coverage: a fence-read failure during releaseForPrompt must still release the underlying lock. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Codex review: needs maintainer review before merge. Reviewed June 2, 2026, 10:12 PM ET / 02:12 UTC. Summary PR surface: Source +10, Tests +25. Total +35 across 2 files. Reproducibility: yes. Source inspection shows current main can clear heldLock before non-ENOENT fence-read errors throw, and the PR body provides before/after live-output fault injection for that current-main path. Review metrics: none identified. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Next step before merge
Security Review detailsBest possible solution: Land the narrow try/finally fix with its regression test after normal required checks and maintainer review pass. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection shows current main can clear heldLock before non-ENOENT fence-read errors throw, and the PR body provides before/after live-output fault injection for that current-main path. Is this the best way to solve the issue? Yes. Releasing the captured lock in a finally at the exact point heldLock is cleared is the narrowest maintainable fix; changing the watchdog or lock manager would leave this controller invariant broken. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against f2a46b066120. Label changesLabel justifications:
Evidence reviewedPR surface: Source +10, Tests +25. Total +35 across 2 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
@clawsweeper re-review — added real-behavior proof to the PR body: an end-to-end harness against the real lock stack (real |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
@clawsweeper automerge |
|
🦞🔧
Draft PRs stay fix-only until GitHub marks them ready for review. Pause with Automerge progress:
|
|
ClawSweeper 🐠 reef update Thanks for the work here. ClawSweeper could not write to the source branch, so it opened a replacement PR rather than letting the fix drift. attribution still points back here. Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
fish notes: model gpt-5.5, reasoning high; reviewed against 394d978. |
|
Adding a fresh same-family production observation that looks like this PR's failure mode, especially the "wedged for the full maxHold lease" part. Observed shape, redacted: Representative redacted log sequence: Important nuance: in this observation, later follow-up turns were already failing before the lock's own This also explains why the issue presents as a Discord thread "not responding" even though Discord inbound is working: the later messages are accepted and routed, but session write fails before a reply can be recorded or delivered. |
Problem
On a long-running gateway, a single agent attempt can orphan the per-session write lock on the live process, wedging a channel for the full
maxHoldMslease (~17 min). Every interactive turn during that window fails withSessionWriteLockTimeoutError.Observed in production: a Discord channel produced no replies from 00:08 until ~00:27, when the lock's lease expired and the watchdog reclaimed it. The lock payload's
pid/starttimematched the live gateway process, so the stale-lock breaker correctly refused to reclaim it (it's a live holder, not a crash orphan) — only the lease TTL freed it.Root cause
releaseHeldLockWithFence(the embedded-attempt session-lock controller's release-for-prompt path,src/agents/embedded-agent-runner/run/attempt.session-lock.ts) does:release()is not in afinally. If either fence read rejects with a transient non-ENOENT error (e.g.EIO/EMFILEunder load) afterheldLockis cleared but beforelock.release(), the function exits with the in-memory reference nulled and the underlying file lock never released.releaseForPrompt()runs on every prompt submission (to free the lock for model I/O), so this is on a hot path. Because the controller thinks it no longer holds the lock,dispose()at attempt teardown seesheldLock === undefinedand does nothing — yet the teardown still releases the in-process session-file owner mutex. The next attempt sails past the mutex and blocks on the orphaned file lock, failing every turn until the watchdog reclaims it atmaxHoldMs.Fix
Wrap the fence bookkeeping in
try/finallyso the file lock is always released onceheldLockis cleared, even if the fence reads throw. The original error still propagates — only the lock-release invariant is made exception-safe.Testing
fs.statrejecting withEIO) duringreleaseForPromptmust still release the underlying lock. Fails before the fix (release called 0 times), passes after.attempt.session-lock.test.ts: 43 pass.session-write-lock.test.ts: 53 pass.agents-embedded-agentsuite: 1301 pass (79 files), no regressions.oxlint+oxfmtclean on changed files.Real behavior proof
Beyond the unit test, I exercised the real lock stack end-to-end — the real
acquireSessionWriteLock, the real@openclaw/fs-safefile-lock manager, and a real on-disk.jsonl.lock— with an end-to-end harness that:fs.statfailure (EIO) into the fence read duringreleaseForPrompt— the production failure mode under fd/I/O pressure,Lock options mirror production (
maxHoldMs: 1_020_000— the ~17 min lease derived from the agent timeout;timeoutMs: 2_000for the acquire).Before the fix (current
main) — the lock is orphaned on the live process and the next turn wedges:After the fix — the lock is released even though the fence read threw, and the next turn proceeds immediately:
This reproduces the exact
SessionWriteLockTimeoutErrorwedge onmainand demonstrates the fix eliminates it — the same symptom seen in production (a Discord channel dead until themaxHoldMslease expired). The harness was throwaway scaffolding (kept out of the diff to preserve its narrowness); the committed regression test covers the same release-invariant at unit scope.