fix(agents): bound abort-path session lock release; force-release on unsettled retained writes by matin · Pull Request #6 · matin/openclaw

matin · 2026-06-03T23:16:08Z

Problem

Prod incident tulgey#225 (membrane, 2026-06-01→03): every embedded-run timeout turned into 5–30 minutes of SessionWriteLockTimeoutError → "Something went wrong" on every subsequent turn, until the maxHoldMs watchdog reclaimed the session .jsonl lock. /new only helped until the fresh session hit its own timeout.

The deployment was on 2026.5.28, which already contains upstream openclaw#87278 (release lock on timeout abort) — and still wedged. The gap: releaseHeldLockForAbort delegates to the graceful fence release, whose waitForRetainedLockIdle is unbounded. A retained-lock transcript write pinned behind a hung provider stream (one that never settles even after runAbortController.abort() — e.g. a stalled streamGenerateContent) keeps the retained-use count nonzero forever, so the abort release never reaches lock.release(). openclaw#87278 / openclaw#88623 / openclaw#89811 all release on aborts that settle; nothing covered aborts a hung provider call never lets settle.

Fix

releaseHeldLockForAbort now runs a bounded variant:

Drain acquisition and the retained-idle wait are each raced against the abort settle bound (OPENCLAW_EMBEDDED_ABORT_SETTLE_TIMEOUT_MS, default 2s / 250ms under OPENCLAW_TEST_FAST), overridable per controller via abortReleaseTimeoutMs.
Retained writes that settle within the bound keep the existing graceful fence semantics (fence stays active, no behavior change — covered by the existing [Bug]: Session write lock leaked after embedded run timeout during subagent announce openclaw/openclaw#86816 tests).
When the bound expires, the underlying file lock is force-released and the controller is poisoned via takeoverDetected, so the aborted run cannot perform torn writes after losing ownership (its later withSessionWriteLock throws EmbeddedAttemptSessionTakeoverError; cleanup degrades to the noop lock — same posture as a real takeover).
If another release path owns the drain and is itself stuck, the abort path takes release ownership directly and hands the eventually-acquired drain straight back (no heldLockDraining leak; dispose() stays unblocked).

Tests

force-releases the held lock when a retained write never settles (tulgey#225) — the repro: never-settling retained write, bounded abort release, lock released once, controller poisoned.
abort release stays graceful when retained writes settle within the bound (tulgey#225) — no poisoning on the happy path.
Full attempt.session-lock.test.ts suite: 90/90. Adjacent suites (attempt-abort, attempt.abort-settle-timeout, command-queue): 39/39. tsgo typecheck clean. CodeRabbit: 0 findings.

Notes

Abort propagation (cancelling the provider fetch/stream itself) is the complementary half; signal plumbing exists in fetch-guard.ts (buildTimeoutAbortSignal composes caller signal + timeout) and the fetch-level timeout did eventually fire in the incident. This PR bounds the damage window regardless of provider behavior — the chokepoint fix.
Candidate for upstreaming to openclaw/openclaw alongside the fix(agents): release session write lock on timeout abort (#86816) openclaw/openclaw#87278 family.

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests
- Added comprehensive test cases for session lock abort-release behavior, covering forced release when operations don't settle and graceful release when they complete within timeout.
New Features
- Added configurable timeout mechanism for session lock release during abort operations to improve handling of stuck write operations.

…unsettled retained writes A turn timeout aborts the run, but releaseHeldLockForAbort delegated to the graceful fence release, whose waitForRetainedLockIdle is unbounded. A retained transcript write pinned behind a hung provider stream (one that ignores abort, e.g. a stalled streamGenerateContent) therefore held the session .jsonl lock until the maxHoldMs watchdog (5-30 min), and every turn in between failed with SessionWriteLockTimeoutError. releaseHeldLockForAbort now races the graceful path against the abort settle bound (OPENCLAW_EMBEDDED_ABORT_SETTLE_TIMEOUT_MS). When the bound expires the underlying file lock is force-released and the controller poisoned via takeoverDetected, so the aborted run cannot perform torn writes after losing ownership. Retained writes that settle within the bound keep the existing graceful fence semantics. Prod incident: tulgey#225 (membrane 2026-06-01..03). Completes the lock-leak family of openclaw#87278 / openclaw#88623 / openclaw#89811, which release on settled aborts but not on aborts a hung provider call never lets settle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-03T23:16:42Z

📝 Walkthrough

Walkthrough

This PR adds timeout-bounded semantics to embedded attempt session lock abort-release paths. The controller now accepts an optional abortReleaseTimeoutMs parameter, and during abort-triggered lock release, a new bounded routine enforces a timeout using Promise.race. If the timeout expires before held locks settle, the lock is force-released and takeover poisoning prevents further writes. Two test cases validate the force-release and graceful-completion paths.

Changes

Bounded timeout for abort-release lock semantics

Layer / File(s)	Summary
Abort-release timeout contract and bounded implementation `src/agents/embedded-agent-runner/run/attempt.session-lock.ts`	`createEmbeddedAttemptSessionLockController` adds optional `abortReleaseTimeoutMs` parameter. New `releaseHeldLockForAbortBounded` routine races held-lock settlement against a timeout; on expiry, lock is force-released and `takeoverDetected` is set to prevent further writes. `releaseHeldLockForAbort` is updated to call the bounded routine, with default timeout resolved via `resolveEmbeddedAbortSettleTimeoutMs`.
Abort-release timeout validation `src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts`	Two new tests: one drives a retained write to non-settling state, triggers abort with short timeout, and asserts force-release with takeover poisoning; the second ensures graceful settlement within a long timeout bound with no poisoning and normal subsequent operation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A lock that times out when abort's too slow,
Force-release the grip, let the timeout glow!
When writes are hung and won't retreat,
Takeover guards ensure safe retreat.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and specifically describes the main change: bounding abort-path session lock release with force-release on unsettled retained writes, matching the core fix for the production incident.
Description check	✅ Passed	The PR description is comprehensive and follows the template structure with clear Problem, Fix, Tests, and Notes sections that explain intent, outcome, and validation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch issue-225-abort-lock-release

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts (1)
166-231: ⚡ Quick win

Add coverage for the timeout branch where another release already owns the drain.

These two cases only exercise the timeout when releaseHeldLockForAbort() becomes the drain owner first. They never hit the drainOwner === "abort-bound" path, so a stuck releaseForPrompt()/dispose() can leave heldLockDraining wedged without this suite noticing. Please add a focused case that starts a graceful release first, then times out the abort path, and asserts dispose() or reacquireAfterPrompt() still resolve.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts` around
lines 166 - 231, Add a new unit test that exercises the timeout branch where the
graceful release becomes the drain owner before the abort path: create a
controller via createEmbeddedAttemptSessionLockController, start a graceful
release path (call whatever triggers the prompt release/releaseForPrompt or
dispose path by invoking withSessionWriteLock then calling
controller.releaseForPrompt()/dispose()), then concurrently call
controller.releaseHeldLockForAbort() with a very small abortReleaseTimeoutMs so
the abort path times out and obtains drainOwner === "abort-bound"; assert that
the abort path times out (i.e., releaseHeldLockForAbort resolves) and that
subsequent calls like controller.dispose() or controller.reacquireAfterPrompt()
still resolve (no wedged heldLockDraining), and verify release was called the
expected number of times. Ensure the test references releaseHeldLockForAbort,
releaseForPrompt/dispose (whichever exists), withSessionWriteLock, and
reacquireAfterPrompt to locate the implementation.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/agents/embedded-agent-runner/run/attempt.session-lock.ts`:
- Around line 872-880: When the timeout path detects drainOwner ===
"abort-bound" it must revoke the drain immediately instead of waiting for
drainAcquisition to resolve; modify the branch in the raceWithBound handler
(around beginHeldLockDrain, drainAcquisition, drainOwner) to clear/reset the
heldLockDraining state right away (so
waitForHeldLockDrain/waitForRetainedLockIdle won't hang) and ensure the pending
drainAcquisition is detached (e.g., keep the void drainAcquisition.then(...) but
do not rely on its resolution to clear heldLockDraining), then call
forceRelease() and return; update finishHeldLockDrain usage so the orphaned
promise becomes a no-op and does not re-set heldLockDraining when it later
resolves.

---

Nitpick comments:
In `@src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts`:
- Around line 166-231: Add a new unit test that exercises the timeout branch
where the graceful release becomes the drain owner before the abort path: create
a controller via createEmbeddedAttemptSessionLockController, start a graceful
release path (call whatever triggers the prompt release/releaseForPrompt or
dispose path by invoking withSessionWriteLock then calling
controller.releaseForPrompt()/dispose()), then concurrently call
controller.releaseHeldLockForAbort() with a very small abortReleaseTimeoutMs so
the abort path times out and obtains drainOwner === "abort-bound"; assert that
the abort path times out (i.e., releaseHeldLockForAbort resolves) and that
subsequent calls like controller.dispose() or controller.reacquireAfterPrompt()
still resolve (no wedged heldLockDraining), and verify release was called the
expected number of times. Ensure the test references releaseHeldLockForAbort,
releaseForPrompt/dispose (whichever exists), withSessionWriteLock, and
reacquireAfterPrompt to locate the implementation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c3b5f49f-9adb-486d-921b-dc5ff5cdbe00

📥 Commits

Reviewing files that changed from the base of the PR and between cad7da3 and 3ff02b7.

📒 Files selected for processing (2)

src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts
src/agents/embedded-agent-runner/run/attempt.session-lock.ts

coderabbitai · 2026-06-03T23:21:51Z

+    const drainAcquisition = beginHeldLockDrain();
+    const drainOwner = await raceWithBound(drainAcquisition);
+    if (drainOwner === "abort-bound") {
+      // Another release path owns the drain and is itself stuck. Take release
+      // ownership directly (clearing `heldLock` makes the stuck path no-op when
+      // it resumes) and hand the eventually-acquired drain straight back.
+      void drainAcquisition.then((owner) => finishHeldLockDrain(owner));
+      await forceRelease();
+      return;


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Timeout force-release can still wedge the controller behind heldLockDraining.

If releaseForPrompt() or dispose() already owns the drain and is hung in waitForRetainedLockIdle(), this branch force-releases heldLock but never clears heldLockDraining. drainAcquisition only resolves after the original owner finishes, which is exactly the never-settling case you are trying to escape. After this timeout, dispose() and reacquireAfterPrompt() can still hang forever in waitForHeldLockDrain() even though the file lock was released. The timeout path needs to revoke/reset the drain immediately instead of waiting for the original owner to eventually hand it back.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agents/embedded-agent-runner/run/attempt.session-lock.ts` around lines 872 - 880, When the timeout path detects drainOwner === "abort-bound" it must revoke the drain immediately instead of waiting for drainAcquisition to resolve; modify the branch in the raceWithBound handler (around beginHeldLockDrain, drainAcquisition, drainOwner) to clear/reset the heldLockDraining state right away (so waitForHeldLockDrain/waitForRetainedLockIdle won't hang) and ensure the pending drainAcquisition is detached (e.g., keep the void drainAcquisition.then(...) but do not rely on its resolution to clear heldLockDraining), then call forceRelease() and return; update finishHeldLockDrain usage so the orphaned promise becomes a no-op and does not re-set heldLockDraining when it later resolves.

…tled holders (#11) tulgey#238 root cause (confirmed by core-dump heap analysis of a live stall): withModelsJsonWriteLock chains each caller onto the prior holder's gate with no timeout. A holder that never settles (turn aborted while run() awaited a hung provider call) leaves the gate unresolved forever, so every subsequent model-resolution in the process parks on a dead promise — silent per-process deadlock: no error, no timeout, no log; runs stall at embedded_run:started. The wait on the prior holder is now bounded (OPENCLAW_MODELS_JSON_LOCK_WAIT_MS, default 60s); on expiry the caller logs and steals the lock. A rare concurrent models.json write (atomic temp-file rename) is recoverable; a deadlocked agent is not. Same family as the session-lock fixes (#6, upstream openclaw#87278). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ettled flushes (#15) A flush whose downstream work never settles (hung pre-agent media call in processMessage) parked every later inbound for that chat key on a dead promise chain: messages logged as Inbound, never reached the command queue, and the chat went permanently silent until restart re-primed the same poison. Fourth member of the unbounded await-the-prior-holder family (session lock #6, models.json #11, lane slot #12). Bounded at OPENCLAW_INBOUND_CHAIN_WAIT_MS (default 5 min) with a warn naming the chat key. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

matin merged commit aa5a446 into main Jun 3, 2026
133 of 150 checks passed

matin deleted the issue-225-abort-lock-release branch June 3, 2026 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agents): bound abort-path session lock release; force-release on unsettled retained writes#6

fix(agents): bound abort-path session lock release; force-release on unsettled retained writes#6
matin merged 1 commit into
mainfrom
issue-225-abort-lock-release

matin commented Jun 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matin commented Jun 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

matin commented Jun 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading