fix(agents): release session write lock if fence read throws on prompt release by spencer2211 · Pull Request #89649 · openclaw/openclaw

spencer2211 · 2026-06-03T01:13:59Z

Problem

On a long-running gateway, a single agent attempt can orphan the per-session write lock on the live process, wedging a channel for the full maxHoldMs lease (~17 min). Every interactive turn during that window fails with SessionWriteLockTimeoutError.

Observed in production: a Discord channel produced no replies from 00:08 until ~00:27, when the lock's lease expired and the watchdog reclaimed it. The lock payload's pid/starttime matched the live gateway process, so the stale-lock breaker correctly refused to reclaim it (it's a live holder, not a crash orphan) — only the lease TTL freed it.

Root cause

releaseHeldLockWithFence (the embedded-attempt session-lock controller's release-for-prompt path, src/agents/embedded-agent-runner/run/attempt.session-lock.ts) does:

const lock = heldLock;
heldLock = undefined;                                       // controller forgets it holds the lock
const fingerprint = await readSessionFileFingerprint(...);  // fs I/O — can throw
...
fenceSnapshot = await readSessionFileFenceSnapshot(...);    // fs I/O — can throw
fenceActive = true;
await lock.release();                                       // only reached if no throw above

release() is not in a finally. If either fence read rejects with a transient non-ENOENT error (e.g. EIO/EMFILE under load) after heldLock is cleared but before lock.release(), the function exits with the in-memory reference nulled and the underlying file lock never released.

releaseForPrompt() runs on every prompt submission (to free the lock for model I/O), so this is on a hot path. Because the controller thinks it no longer holds the lock, dispose() at attempt teardown sees heldLock === undefined and does nothing — yet the teardown still releases the in-process session-file owner mutex. The next attempt sails past the mutex and blocks on the orphaned file lock, failing every turn until the watchdog reclaims it at maxHoldMs.

Fix

Wrap the fence bookkeeping in try/finally so the file lock is always released once heldLock is cleared, even if the fence reads throw. The original error still propagates — only the lock-release invariant is made exception-safe.

Testing

New regression test: a fence-read failure (fs.stat rejecting with EIO) during releaseForPrompt must still release the underlying lock. Fails before the fix (release called 0 times), passes after.
attempt.session-lock.test.ts: 43 pass.
session-write-lock.test.ts: 53 pass.
Full agents-embedded-agent suite: 1301 pass (79 files), no regressions.
oxlint + oxfmt clean on changed files.

Real behavior proof

Beyond the unit test, I exercised the real lock stack end-to-end — the real acquireSessionWriteLock, the real @openclaw/fs-safe file-lock manager, and a real on-disk .jsonl.lock — with an end-to-end harness that:

starts an attempt (controller eagerly acquires the real session write lock),
injects a single transient non-ENOENT fs.stat failure (EIO) into the fence read during releaseForPrompt — the production failure mode under fd/I/O pressure,
then has the next interactive turn attempt to acquire the same session lock.

Lock options mirror production (maxHoldMs: 1_020_000 — the ~17 min lease derived from the agent timeout; timeoutMs: 2_000 for the acquire).

Before the fix (current main) — the lock is orphaned on the live process and the next turn wedges:

[proof] after acquire: lock file on disk = true
[proof] releaseForPrompt threw (expected): EIO
[proof] after release: lock file on disk = true          ← lock leaked
[proof] next interactive acquire: FAILED after 2002ms: SessionWriteLockTimeoutError

After the fix — the lock is released even though the fence read threw, and the next turn proceeds immediately:

[proof] after acquire: lock file on disk = true
[proof] releaseForPrompt threw (expected): EIO
[proof] after release: lock file on disk = false         ← lock released
[proof] next interactive acquire: SUCCEEDED in 1ms

This reproduces the exact SessionWriteLockTimeoutError wedge on main and demonstrates the fix eliminates it — the same symptom seen in production (a Discord channel dead until the maxHoldMs lease expired). The harness was throwaway scaffolding (kept out of the diff to preserve its narrowness); the committed regression test covers the same release-invariant at unit scope.

Note: the fs.stat EIO is deliberate fault injection — the bug is a transient-error race that doesn't reproduce on demand in a healthy environment, so the proof induces the documented failure condition against the real lock stack rather than waiting for organic I/O pressure.

…t release releaseHeldLockWithFence (the embedded-attempt session-lock controller's release-for-prompt path) cleared its in-memory `heldLock` reference and then performed fence bookkeeping — readSessionFileFingerprint and readSessionFileFenceSnapshot — before calling lock.release(). Both perform filesystem I/O that can reject with a transient non-ENOENT error (e.g. EIO/EMFILE under load). Because release() was not in a finally, such a throw exited the function with `heldLock` already nulled and the underlying file lock never released. The lock then leaked on the live gateway process for the full maxHoldMs lease. Since the controller holds the lock with maxHoldMs derived from the agent timeout (~17 min) and the holder is the live pid (its /proc starttime matches the lock payload), the stale-lock breaker correctly refuses to reclaim it — only the watchdog TTL frees it. Meanwhile the attempt teardown still releases the in-process session-file owner mutex, so the next interactive turn sails past the mutex and blocks on the orphaned file lock, failing every turn with SessionWriteLockTimeoutError for up to 17 minutes (observed: a Discord channel dead 00:08–00:27). Wrap the fence bookkeeping in try/finally so the file lock is always released once heldLock is cleared, even if the fence reads throw. The original error still propagates; only the lock-release invariant is made exception-safe. Add regression coverage: a fence-read failure during releaseForPrompt must still release the underlying lock. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clawsweeper · 2026-06-03T01:15:53Z

Codex review: needs maintainer review before merge. Reviewed June 2, 2026, 10:12 PM ET / 02:12 UTC.

Summary
The PR wraps embedded attempt prompt-release fence bookkeeping in try/finally so the captured session write lock is released when fence reads throw, and adds a regression test.

PR surface: Source +10, Tests +25. Total +35 across 2 files.

Reproducibility: yes. Source inspection shows current main can clear heldLock before non-ENOENT fence-read errors throw, and the PR body provides before/after live-output fault injection for that current-main path.

Review metrics: none identified.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Next step before merge

[P2] No repair lane is needed because there are no actionable patch findings; maintainers need ordinary review and required-check handling.

Security
Cleared: The diff only changes session-lock release control flow and a colocated test; it does not add dependency, CI, secret, package, or code-execution surface.

Review details

Best possible solution:

Land the narrow try/finally fix with its regression test after normal required checks and maintainer review pass.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection shows current main can clear heldLock before non-ENOENT fence-read errors throw, and the PR body provides before/after live-output fault injection for that current-main path.

Is this the best way to solve the issue?

Yes. Releasing the captured lock in a finally at the exact point heldLock is cleared is the narrowest maintainable fix; changing the watchdog or lock manager would leave this controller invariant broken.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against f2a46b066120.

Label changes

Label justifications:

P1: The PR fixes a live agent/channel availability failure where a leaked session write lock can block every interactive turn until the long maxHoldMs lease expires.
rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body includes copied before/after live output from a real lock-stack harness showing current main wedging on injected EIO and the patch releasing the lock so the next acquire succeeds.
proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes copied before/after live output from a real lock-stack harness showing current main wedging on injected EIO and the patch releasing the lock so the next acquire succeeds.

Evidence reviewed

PR surface:

Source +10, Tests +25. Total +35 across 2 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	1	21	11	+10
Tests	1	25	0	+25
Docs	0	0	0	0
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	2	46	11	+35

What I checked:

Repository policy read: Read the full root AGENTS.md and scoped agent runner AGENTS.md files; their deep-review, dependency-contract, hot-path, and focused-test guidance applied to this review. (AGENTS.md:1, f2a46b066120)
Current-main vulnerable ordering: Current main copies heldLock, clears it, then awaits fingerprint and fence-snapshot filesystem reads before calling lock.release, so a non-ENOENT read error exits before releasing the underlying lock. (src/agents/embedded-agent-runner/run/attempt.session-lock.ts:802, f2a46b066120)
Fence read can throw the reported error class: readSessionFileFingerprint converts only ENOENT to exists:false and rethrows other fs.stat failures, matching the PR's transient EIO/EMFILE failure path. (src/agents/embedded-agent-runner/run/attempt.session-lock.ts:524, f2a46b066120)
Prompt-release is on the interactive path: The embedded runner installs releaseForPrompt around provider prompt submission and also calls releaseForPrompt before later post-prompt waits, so this invariant affects real interactive turns. (src/agents/embedded-agent-runner/run/attempt.ts:3964, f2a46b066120)
Dependency contract checked: acquireSessionWriteLock returns the fs-safe release handle, and @openclaw/fs-safe 0.3.0 sidecar-lock removes the held entry and sidecar file only when that release handle runs. (src/agents/session-write-lock.ts:921, f2a46b066120)
PR patch shape: The PR puts fence bookkeeping inside try and awaits lock.release in finally after heldLock is cleared, which directly preserves the release invariant at the vulnerable point. (src/agents/embedded-agent-runner/run/attempt.session-lock.ts:803, 1d91f1acb810)

Likely related people:

vincentkoc: git blame and git show tie the current embedded attempt session-lock controller, tests, and releaseHeldLockWithFence ordering to commit 984c3de by Vincent Koc. (role: introduced current behavior / recent area contributor; confidence: high; commits: 984c3ded9a34, f00f0a95968f; files: src/agents/embedded-agent-runner/run/attempt.session-lock.ts, src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts, src/agents/session-write-lock.ts)
steipete: git log shows prior session write-lock hardening and release-path refactoring in the adjacent session lock module, including commit f4782e1. (role: adjacent session-lock contributor; confidence: medium; commits: f4782e1e730e, 0cf533ac6157, fb6e415d0c52; files: src/agents/session-write-lock.ts, src/agents/embedded-agent-runner/run/attempt.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

spencer2211 · 2026-06-03T01:54:59Z

@clawsweeper re-review — added real-behavior proof to the PR body: an end-to-end harness against the real lock stack (real acquireSessionWriteLock + @openclaw/fs-safe manager + on-disk .jsonl.lock) that injects the transient EIO fence-read failure and shows the next interactive acquire wedges with SessionWriteLockTimeoutError on current main (FAILED after 2002ms) but SUCCEEDS in 1ms with the fix (lock released instead of orphaned).

clawsweeper · 2026-06-03T01:55:02Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/26859022550
Updated: 2026-06-03T02:02:25.283Z

Takhoffman · 2026-06-03T11:33:28Z

@clawsweeper automerge

clawsweeper · 2026-06-03T11:33:32Z

🦞🔧
ClawSweeper automerge is enabled.

Head: 1d91f1acb810
Label: clawsweeper:automerge
Action: repair worker queued. Run: https://github.com/openclaw/clawsweeper/actions/runs/26882058280
Flow: review this head, repair/rebase only if needed, then re-review the exact repaired head before merge.

Draft PRs stay fix-only until GitHub marks them ready for review. Pause with /clawsweeper stop.

Automerge progress:

2026-06-03 11:33:28 UTC review queued 1d91f1acb810 (queued)

2026-06-03 11:34:29 UTC repair queued 1d91f1acb810 (autonomous) Run: https://github.com/openclaw/clawsweeper/actions/runs/26882058280

2026-06-03 11:35:55 UTC repair started (running) in 0s Run: https://github.com/openclaw/clawsweeper/actions/runs/26882058280 automerge-openclaw-openclaw-89649

2026-06-03 11:36:10 UTC validation plan (passed) in 16s Run: https://github.com/openclaw/clawsweeper/actions/runs/26882058280 pnpm check:changed; pnpm lint; pnpm check:test-types

2026-06-03 11:36:24 UTC Codex write preflight (passed) in 29s Run: https://github.com/openclaw/clawsweeper/actions/runs/26882058280 danger-full-access

2026-06-03 11:40:48 UTC Codex edit 1 d0b2b4ba3bc2 (complete) in 4m 54s Run: https://github.com/openclaw/clawsweeper/actions/runs/26882058280 exit 0

2026-06-03 11:43:57 UTC validation and review 1 394d9784371e (complete) in 8m 3s Run: https://github.com/openclaw/clawsweeper/actions/runs/26882058280 already-current

2026-06-03 11:44:35 UTC repair finished 394d9784371e (opened) in 8m 40s Run: https://github.com/openclaw/clawsweeper/actions/runs/26882058280 open_fix_pr

clawsweeper · 2026-06-03T11:44:32Z

ClawSweeper 🐠 reef update

Thanks for the work here. ClawSweeper could not write to the source branch, so it opened a replacement PR rather than letting the fix drift. attribution still points back here.

Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
Replacement PR: #89811
Why close: this run explicitly closes the superseded source PR after the credited replacement PR is open, so review continues in one place.
This closeout is intentional for this run: the replacement PR is now the active review lane.
The original contribution stays credited in the replacement PR context.
Co-author credit kept:

@spencer2211: Co-authored-by: Spencer Fuller 28957500+spencer2211@users.noreply.github.com

fish notes: model gpt-5.5, reasoning high; reviewed against 394d978.

anyech · 2026-06-03T20:14:49Z

Adding a fresh same-family production observation that looks like this PR's failure mode, especially the "wedged for the full maxHold lease" part.

Observed shape, redacted:

user follow-up arrives in the same Discord thread
agent turn starts and reaches memory_search
stalled session diagnostics report activeTool=memory_search for several minutes
stuck session recovery aborts the embedded run
embedded abort settle timed out
later user follow-ups do enter the lane, but fail before reply with SessionWriteLockTimeoutError
lock file pid/starttime match the still-running gateway process
lock maxHoldMs is 1020000ms (~17m)

Representative redacted log sequence:

stalled session ... reason=blocked_tool_call activeTool=memory_search activeToolAge=388s
stuck session recovery ... action=abort_embedded_run aborted=true drained=true
embedded abort settle timed out ... timeoutMs=2000
lane task error ... SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=<live-gateway-pid> .../<session>.jsonl.lock
Embedded agent failed before reply: session file locked (timeout 60000ms): pid=<live-gateway-pid> .../<session>.jsonl.lock

Important nuance: in this observation, later follow-up turns were already failing before the lock's own maxHoldMs had elapsed. So reclaim-after-maxHold PRs such as #89673 are still useful as a safety net, but they do not prevent the user-visible 17-minute dead window. That makes the prompt-release / abort-release leak class addressed here still important.

This also explains why the issue presents as a Discord thread "not responding" even though Discord inbound is working: the later messages are accepted and routed, but session write fails before a reply can be recorded or delivered.

openclaw-barnacle Bot added agents Agent runtime and tooling size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels Jun 3, 2026

openclaw-barnacle Bot removed the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label Jun 3, 2026

clawsweeper Bot added the clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge label Jun 3, 2026

clawsweeper Bot mentioned this pull request Jun 3, 2026

fix(agents): release session write lock if fence read throws on prompt release #89811

Merged

clawsweeper Bot closed this Jun 3, 2026

anyech mentioned this pull request Jun 3, 2026

fix(agents): reclaim session write-locks held past the holder's own maxHoldMs #89673

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): release session write lock if fence read throws on prompt release#89649

fix(agents): release session write lock if fence read throws on prompt release#89649
spencer2211 wants to merge 1 commit into
openclaw:mainfrom
spencer2211:fix/session-write-lock-release-on-fence-error

spencer2211 commented Jun 3, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

spencer2211 commented Jun 3, 2026

Uh oh!

clawsweeper Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

Takhoffman commented Jun 3, 2026

Uh oh!

clawsweeper Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented Jun 3, 2026

Uh oh!

anyech commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

spencer2211 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Testing

Real behavior proof

Uh oh!

clawsweeper Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spencer2211 commented Jun 3, 2026

Uh oh!

clawsweeper Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Takhoffman commented Jun 3, 2026

Uh oh!

clawsweeper Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented Jun 3, 2026

Uh oh!

anyech commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

spencer2211 commented Jun 3, 2026 •

edited

Loading

clawsweeper Bot commented Jun 3, 2026 •

edited

Loading

clawsweeper Bot commented Jun 3, 2026 •

edited

Loading

clawsweeper Bot commented Jun 3, 2026 •

edited

Loading