Skip to content

fix(agents): release session write lock if fence read throws on prompt release#89649

Closed
spencer2211 wants to merge 1 commit into
openclaw:mainfrom
spencer2211:fix/session-write-lock-release-on-fence-error
Closed

fix(agents): release session write lock if fence read throws on prompt release#89649
spencer2211 wants to merge 1 commit into
openclaw:mainfrom
spencer2211:fix/session-write-lock-release-on-fence-error

Conversation

@spencer2211

@spencer2211 spencer2211 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Problem

On a long-running gateway, a single agent attempt can orphan the per-session write lock on the live process, wedging a channel for the full maxHoldMs lease (~17 min). Every interactive turn during that window fails with SessionWriteLockTimeoutError.

Observed in production: a Discord channel produced no replies from 00:08 until ~00:27, when the lock's lease expired and the watchdog reclaimed it. The lock payload's pid/starttime matched the live gateway process, so the stale-lock breaker correctly refused to reclaim it (it's a live holder, not a crash orphan) — only the lease TTL freed it.

Root cause

releaseHeldLockWithFence (the embedded-attempt session-lock controller's release-for-prompt path, src/agents/embedded-agent-runner/run/attempt.session-lock.ts) does:

const lock = heldLock;
heldLock = undefined;                                       // controller forgets it holds the lock
const fingerprint = await readSessionFileFingerprint(...);  // fs I/O — can throw
...
fenceSnapshot = await readSessionFileFenceSnapshot(...);    // fs I/O — can throw
fenceActive = true;
await lock.release();                                       // only reached if no throw above

release() is not in a finally. If either fence read rejects with a transient non-ENOENT error (e.g. EIO/EMFILE under load) after heldLock is cleared but before lock.release(), the function exits with the in-memory reference nulled and the underlying file lock never released.

releaseForPrompt() runs on every prompt submission (to free the lock for model I/O), so this is on a hot path. Because the controller thinks it no longer holds the lock, dispose() at attempt teardown sees heldLock === undefined and does nothing — yet the teardown still releases the in-process session-file owner mutex. The next attempt sails past the mutex and blocks on the orphaned file lock, failing every turn until the watchdog reclaims it at maxHoldMs.

Fix

Wrap the fence bookkeeping in try/finally so the file lock is always released once heldLock is cleared, even if the fence reads throw. The original error still propagates — only the lock-release invariant is made exception-safe.

Testing

  • New regression test: a fence-read failure (fs.stat rejecting with EIO) during releaseForPrompt must still release the underlying lock. Fails before the fix (release called 0 times), passes after.
  • attempt.session-lock.test.ts: 43 pass.
  • session-write-lock.test.ts: 53 pass.
  • Full agents-embedded-agent suite: 1301 pass (79 files), no regressions.
  • oxlint + oxfmt clean on changed files.

Real behavior proof

Beyond the unit test, I exercised the real lock stack end-to-end — the real acquireSessionWriteLock, the real @openclaw/fs-safe file-lock manager, and a real on-disk .jsonl.lock — with an end-to-end harness that:

  1. starts an attempt (controller eagerly acquires the real session write lock),
  2. injects a single transient non-ENOENT fs.stat failure (EIO) into the fence read during releaseForPrompt — the production failure mode under fd/I/O pressure,
  3. then has the next interactive turn attempt to acquire the same session lock.

Lock options mirror production (maxHoldMs: 1_020_000 — the ~17 min lease derived from the agent timeout; timeoutMs: 2_000 for the acquire).

Before the fix (current main) — the lock is orphaned on the live process and the next turn wedges:

[proof] after acquire: lock file on disk = true
[proof] releaseForPrompt threw (expected): EIO
[proof] after release: lock file on disk = true          ← lock leaked
[proof] next interactive acquire: FAILED after 2002ms: SessionWriteLockTimeoutError

After the fix — the lock is released even though the fence read threw, and the next turn proceeds immediately:

[proof] after acquire: lock file on disk = true
[proof] releaseForPrompt threw (expected): EIO
[proof] after release: lock file on disk = false         ← lock released
[proof] next interactive acquire: SUCCEEDED in 1ms

This reproduces the exact SessionWriteLockTimeoutError wedge on main and demonstrates the fix eliminates it — the same symptom seen in production (a Discord channel dead until the maxHoldMs lease expired). The harness was throwaway scaffolding (kept out of the diff to preserve its narrowness); the committed regression test covers the same release-invariant at unit scope.

Note: the fs.stat EIO is deliberate fault injection — the bug is a transient-error race that doesn't reproduce on demand in a healthy environment, so the proof induces the documented failure condition against the real lock stack rather than waiting for organic I/O pressure.

…t release

releaseHeldLockWithFence (the embedded-attempt session-lock controller's
release-for-prompt path) cleared its in-memory `heldLock` reference and then
performed fence bookkeeping — readSessionFileFingerprint and
readSessionFileFenceSnapshot — before calling lock.release(). Both perform
filesystem I/O that can reject with a transient non-ENOENT error (e.g.
EIO/EMFILE under load). Because release() was not in a finally, such a throw
exited the function with `heldLock` already nulled and the underlying file
lock never released.

The lock then leaked on the live gateway process for the full maxHoldMs lease.
Since the controller holds the lock with maxHoldMs derived from the agent
timeout (~17 min) and the holder is the live pid (its /proc starttime matches
the lock payload), the stale-lock breaker correctly refuses to reclaim it — only
the watchdog TTL frees it. Meanwhile the attempt teardown still releases the
in-process session-file owner mutex, so the next interactive turn sails past the
mutex and blocks on the orphaned file lock, failing every turn with
SessionWriteLockTimeoutError for up to 17 minutes (observed: a Discord channel
dead 00:08–00:27).

Wrap the fence bookkeeping in try/finally so the file lock is always released
once heldLock is cleared, even if the fence reads throw. The original error
still propagates; only the lock-release invariant is made exception-safe.

Add regression coverage: a fence-read failure during releaseForPrompt must still
release the underlying lock.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels Jun 3, 2026
@clawsweeper

clawsweeper Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 2, 2026, 10:12 PM ET / 02:12 UTC.

Summary
The PR wraps embedded attempt prompt-release fence bookkeeping in try/finally so the captured session write lock is released when fence reads throw, and adds a regression test.

PR surface: Source +10, Tests +25. Total +35 across 2 files.

Reproducibility: yes. Source inspection shows current main can clear heldLock before non-ENOENT fence-read errors throw, and the PR body provides before/after live-output fault injection for that current-main path.

Review metrics: none identified.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Next step before merge

  • [P2] No repair lane is needed because there are no actionable patch findings; maintainers need ordinary review and required-check handling.

Security
Cleared: The diff only changes session-lock release control flow and a colocated test; it does not add dependency, CI, secret, package, or code-execution surface.

Review details

Best possible solution:

Land the narrow try/finally fix with its regression test after normal required checks and maintainer review pass.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection shows current main can clear heldLock before non-ENOENT fence-read errors throw, and the PR body provides before/after live-output fault injection for that current-main path.

Is this the best way to solve the issue?

Yes. Releasing the captured lock in a finally at the exact point heldLock is cleared is the narrowest maintainable fix; changing the watchdog or lock manager would leave this controller invariant broken.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against f2a46b066120.

Label changes

Label justifications:

  • P1: The PR fixes a live agent/channel availability failure where a leaked session write lock can block every interactive turn until the long maxHoldMs lease expires.
  • rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body includes copied before/after live output from a real lock-stack harness showing current main wedging on injected EIO and the patch releasing the lock so the next acquire succeeds.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes copied before/after live output from a real lock-stack harness showing current main wedging on injected EIO and the patch releasing the lock so the next acquire succeeds.
Evidence reviewed

PR surface:

Source +10, Tests +25. Total +35 across 2 files.

View PR surface stats
Area Files Added Removed Net
Source 1 21 11 +10
Tests 1 25 0 +25
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 2 46 11 +35

What I checked:

Likely related people:

  • vincentkoc: git blame and git show tie the current embedded attempt session-lock controller, tests, and releaseHeldLockWithFence ordering to commit 984c3de by Vincent Koc. (role: introduced current behavior / recent area contributor; confidence: high; commits: 984c3ded9a34, f00f0a95968f; files: src/agents/embedded-agent-runner/run/attempt.session-lock.ts, src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts, src/agents/session-write-lock.ts)
  • steipete: git log shows prior session write-lock hardening and release-path refactoring in the adjacent session lock module, including commit f4782e1. (role: adjacent session-lock contributor; confidence: medium; commits: f4782e1e730e, 0cf533ac6157, fb6e415d0c52; files: src/agents/session-write-lock.ts, src/agents/embedded-agent-runner/run/attempt.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. labels Jun 3, 2026
@spencer2211

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review — added real-behavior proof to the PR body: an end-to-end harness against the real lock stack (real acquireSessionWriteLock + @openclaw/fs-safe manager + on-disk .jsonl.lock) that injects the transient EIO fence-read failure and shows the next interactive acquire wedges with SessionWriteLockTimeoutError on current main (FAILED after 2002ms) but SUCCEEDS in 1ms with the fix (lock released instead of orphaned).

@clawsweeper

clawsweeper Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 3, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label Jun 3, 2026
@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper automerge

@clawsweeper

clawsweeper Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

🦞🔧
ClawSweeper automerge is enabled.

Draft PRs stay fix-only until GitHub marks them ready for review. Pause with /clawsweeper stop.

Automerge progress:

  • 2026-06-03 11:33:28 UTC review queued 1d91f1acb810 (queued)

@clawsweeper clawsweeper Bot added the clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge label Jun 3, 2026
@clawsweeper

clawsweeper Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper 🐠 reef update

Thanks for the work here. ClawSweeper could not write to the source branch, so it opened a replacement PR rather than letting the fix drift. attribution still points back here.

Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
Replacement PR: #89811
Why close: this run explicitly closes the superseded source PR after the credited replacement PR is open, so review continues in one place.
This closeout is intentional for this run: the replacement PR is now the active review lane.
The original contribution stays credited in the replacement PR context.
Co-author credit kept:

fish notes: model gpt-5.5, reasoning high; reviewed against 394d978.

@clawsweeper clawsweeper Bot closed this Jun 3, 2026
@anyech

anyech commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Adding a fresh same-family production observation that looks like this PR's failure mode, especially the "wedged for the full maxHold lease" part.

Observed shape, redacted:

user follow-up arrives in the same Discord thread
agent turn starts and reaches memory_search
stalled session diagnostics report activeTool=memory_search for several minutes
stuck session recovery aborts the embedded run
embedded abort settle timed out
later user follow-ups do enter the lane, but fail before reply with SessionWriteLockTimeoutError
lock file pid/starttime match the still-running gateway process
lock maxHoldMs is 1020000ms (~17m)

Representative redacted log sequence:

stalled session ... reason=blocked_tool_call activeTool=memory_search activeToolAge=388s
stuck session recovery ... action=abort_embedded_run aborted=true drained=true
embedded abort settle timed out ... timeoutMs=2000
lane task error ... SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=<live-gateway-pid> .../<session>.jsonl.lock
Embedded agent failed before reply: session file locked (timeout 60000ms): pid=<live-gateway-pid> .../<session>.jsonl.lock

Important nuance: in this observation, later follow-up turns were already failing before the lock's own maxHoldMs had elapsed. So reclaim-after-maxHold PRs such as #89673 are still useful as a safety net, but they do not prevent the user-visible 17-minute dead window. That makes the prompt-release / abort-release leak class addressed here still important.

This also explains why the issue presents as a Discord thread "not responding" even though Discord inbound is working: the later messages are accepted and routed, but session write fails before a reply can be recorded or delivered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. size: S status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants