fix(agents): bound abort-path session lock release; force-release on unsettled retained writes#6
Conversation
…unsettled retained writes A turn timeout aborts the run, but releaseHeldLockForAbort delegated to the graceful fence release, whose waitForRetainedLockIdle is unbounded. A retained transcript write pinned behind a hung provider stream (one that ignores abort, e.g. a stalled streamGenerateContent) therefore held the session .jsonl lock until the maxHoldMs watchdog (5-30 min), and every turn in between failed with SessionWriteLockTimeoutError. releaseHeldLockForAbort now races the graceful path against the abort settle bound (OPENCLAW_EMBEDDED_ABORT_SETTLE_TIMEOUT_MS). When the bound expires the underlying file lock is force-released and the controller poisoned via takeoverDetected, so the aborted run cannot perform torn writes after losing ownership. Retained writes that settle within the bound keep the existing graceful fence semantics. Prod incident: tulgey#225 (membrane 2026-06-01..03). Completes the lock-leak family of openclaw#87278 / openclaw#88623 / openclaw#89811, which release on settled aborts but not on aborts a hung provider call never lets settle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR adds timeout-bounded semantics to embedded attempt session lock abort-release paths. The controller now accepts an optional ChangesBounded timeout for abort-release lock semantics
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts (1)
166-231: ⚡ Quick winAdd coverage for the timeout branch where another release already owns the drain.
These two cases only exercise the timeout when
releaseHeldLockForAbort()becomes the drain owner first. They never hit thedrainOwner === "abort-bound"path, so a stuckreleaseForPrompt()/dispose()can leaveheldLockDrainingwedged without this suite noticing. Please add a focused case that starts a graceful release first, then times out the abort path, and assertsdispose()orreacquireAfterPrompt()still resolve.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts` around lines 166 - 231, Add a new unit test that exercises the timeout branch where the graceful release becomes the drain owner before the abort path: create a controller via createEmbeddedAttemptSessionLockController, start a graceful release path (call whatever triggers the prompt release/releaseForPrompt or dispose path by invoking withSessionWriteLock then calling controller.releaseForPrompt()/dispose()), then concurrently call controller.releaseHeldLockForAbort() with a very small abortReleaseTimeoutMs so the abort path times out and obtains drainOwner === "abort-bound"; assert that the abort path times out (i.e., releaseHeldLockForAbort resolves) and that subsequent calls like controller.dispose() or controller.reacquireAfterPrompt() still resolve (no wedged heldLockDraining), and verify release was called the expected number of times. Ensure the test references releaseHeldLockForAbort, releaseForPrompt/dispose (whichever exists), withSessionWriteLock, and reacquireAfterPrompt to locate the implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/agents/embedded-agent-runner/run/attempt.session-lock.ts`:
- Around line 872-880: When the timeout path detects drainOwner ===
"abort-bound" it must revoke the drain immediately instead of waiting for
drainAcquisition to resolve; modify the branch in the raceWithBound handler
(around beginHeldLockDrain, drainAcquisition, drainOwner) to clear/reset the
heldLockDraining state right away (so
waitForHeldLockDrain/waitForRetainedLockIdle won't hang) and ensure the pending
drainAcquisition is detached (e.g., keep the void drainAcquisition.then(...) but
do not rely on its resolution to clear heldLockDraining), then call
forceRelease() and return; update finishHeldLockDrain usage so the orphaned
promise becomes a no-op and does not re-set heldLockDraining when it later
resolves.
---
Nitpick comments:
In `@src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts`:
- Around line 166-231: Add a new unit test that exercises the timeout branch
where the graceful release becomes the drain owner before the abort path: create
a controller via createEmbeddedAttemptSessionLockController, start a graceful
release path (call whatever triggers the prompt release/releaseForPrompt or
dispose path by invoking withSessionWriteLock then calling
controller.releaseForPrompt()/dispose()), then concurrently call
controller.releaseHeldLockForAbort() with a very small abortReleaseTimeoutMs so
the abort path times out and obtains drainOwner === "abort-bound"; assert that
the abort path times out (i.e., releaseHeldLockForAbort resolves) and that
subsequent calls like controller.dispose() or controller.reacquireAfterPrompt()
still resolve (no wedged heldLockDraining), and verify release was called the
expected number of times. Ensure the test references releaseHeldLockForAbort,
releaseForPrompt/dispose (whichever exists), withSessionWriteLock, and
reacquireAfterPrompt to locate the implementation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c3b5f49f-9adb-486d-921b-dc5ff5cdbe00
📒 Files selected for processing (2)
src/agents/embedded-agent-runner/run/attempt.session-lock.test.tssrc/agents/embedded-agent-runner/run/attempt.session-lock.ts
| const drainAcquisition = beginHeldLockDrain(); | ||
| const drainOwner = await raceWithBound(drainAcquisition); | ||
| if (drainOwner === "abort-bound") { | ||
| // Another release path owns the drain and is itself stuck. Take release | ||
| // ownership directly (clearing `heldLock` makes the stuck path no-op when | ||
| // it resumes) and hand the eventually-acquired drain straight back. | ||
| void drainAcquisition.then((owner) => finishHeldLockDrain(owner)); | ||
| await forceRelease(); | ||
| return; |
There was a problem hiding this comment.
Timeout force-release can still wedge the controller behind heldLockDraining.
If releaseForPrompt() or dispose() already owns the drain and is hung in waitForRetainedLockIdle(), this branch force-releases heldLock but never clears heldLockDraining. drainAcquisition only resolves after the original owner finishes, which is exactly the never-settling case you are trying to escape. After this timeout, dispose() and reacquireAfterPrompt() can still hang forever in waitForHeldLockDrain() even though the file lock was released. The timeout path needs to revoke/reset the drain immediately instead of waiting for the original owner to eventually hand it back.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/agents/embedded-agent-runner/run/attempt.session-lock.ts` around lines
872 - 880, When the timeout path detects drainOwner === "abort-bound" it must
revoke the drain immediately instead of waiting for drainAcquisition to resolve;
modify the branch in the raceWithBound handler (around beginHeldLockDrain,
drainAcquisition, drainOwner) to clear/reset the heldLockDraining state right
away (so waitForHeldLockDrain/waitForRetainedLockIdle won't hang) and ensure the
pending drainAcquisition is detached (e.g., keep the void
drainAcquisition.then(...) but do not rely on its resolution to clear
heldLockDraining), then call forceRelease() and return; update
finishHeldLockDrain usage so the orphaned promise becomes a no-op and does not
re-set heldLockDraining when it later resolves.
…tled holders (#11) tulgey#238 root cause (confirmed by core-dump heap analysis of a live stall): withModelsJsonWriteLock chains each caller onto the prior holder's gate with no timeout. A holder that never settles (turn aborted while run() awaited a hung provider call) leaves the gate unresolved forever, so every subsequent model-resolution in the process parks on a dead promise — silent per-process deadlock: no error, no timeout, no log; runs stall at embedded_run:started. The wait on the prior holder is now bounded (OPENCLAW_MODELS_JSON_LOCK_WAIT_MS, default 60s); on expiry the caller logs and steals the lock. A rare concurrent models.json write (atomic temp-file rename) is recoverable; a deadlocked agent is not. Same family as the session-lock fixes (#6, upstream openclaw#87278). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ettled flushes (#15) A flush whose downstream work never settles (hung pre-agent media call in processMessage) parked every later inbound for that chat key on a dead promise chain: messages logged as Inbound, never reached the command queue, and the chat went permanently silent until restart re-primed the same poison. Fourth member of the unbounded await-the-prior-holder family (session lock #6, models.json #11, lane slot #12). Bounded at OPENCLAW_INBOUND_CHAIN_WAIT_MS (default 5 min) with a warn naming the chat key. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
Prod incident tulgey#225 (membrane, 2026-06-01→03): every embedded-run timeout turned into 5–30 minutes of
SessionWriteLockTimeoutError→ "Something went wrong" on every subsequent turn, until themaxHoldMswatchdog reclaimed the session.jsonllock./newonly helped until the fresh session hit its own timeout.The deployment was on 2026.5.28, which already contains upstream openclaw#87278 (release lock on timeout abort) — and still wedged. The gap:
releaseHeldLockForAbortdelegates to the graceful fence release, whosewaitForRetainedLockIdleis unbounded. A retained-lock transcript write pinned behind a hung provider stream (one that never settles even afterrunAbortController.abort()— e.g. a stalledstreamGenerateContent) keeps the retained-use count nonzero forever, so the abort release never reacheslock.release(). openclaw#87278 / openclaw#88623 / openclaw#89811 all release on aborts that settle; nothing covered aborts a hung provider call never lets settle.Fix
releaseHeldLockForAbortnow runs a bounded variant:OPENCLAW_EMBEDDED_ABORT_SETTLE_TIMEOUT_MS, default 2s / 250ms underOPENCLAW_TEST_FAST), overridable per controller viaabortReleaseTimeoutMs.takeoverDetected, so the aborted run cannot perform torn writes after losing ownership (its laterwithSessionWriteLockthrowsEmbeddedAttemptSessionTakeoverError; cleanup degrades to the noop lock — same posture as a real takeover).heldLockDrainingleak;dispose()stays unblocked).Tests
force-releases the held lock when a retained write never settles (tulgey#225)— the repro: never-settling retained write, bounded abort release, lock released once, controller poisoned.abort release stays graceful when retained writes settle within the bound (tulgey#225)— no poisoning on the happy path.attempt.session-lock.test.tssuite: 90/90. Adjacent suites (attempt-abort,attempt.abort-settle-timeout,command-queue): 39/39. tsgo typecheck clean. CodeRabbit: 0 findings.Notes
fetch-guard.ts(buildTimeoutAbortSignalcomposes caller signal + timeout) and the fetch-level timeout did eventually fire in the incident. This PR bounds the damage window regardless of provider behavior — the chokepoint fix.🤖 Generated with Claude Code
Summary by CodeRabbit
Tests
New Features