Skip to content

fix(agents): bound abort-path session lock release; force-release on unsettled retained writes#6

Merged
matin merged 1 commit into
mainfrom
issue-225-abort-lock-release
Jun 3, 2026
Merged

fix(agents): bound abort-path session lock release; force-release on unsettled retained writes#6
matin merged 1 commit into
mainfrom
issue-225-abort-lock-release

Conversation

@matin

@matin matin commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Problem

Prod incident tulgey#225 (membrane, 2026-06-01→03): every embedded-run timeout turned into 5–30 minutes of SessionWriteLockTimeoutError → "Something went wrong" on every subsequent turn, until the maxHoldMs watchdog reclaimed the session .jsonl lock. /new only helped until the fresh session hit its own timeout.

The deployment was on 2026.5.28, which already contains upstream openclaw#87278 (release lock on timeout abort) — and still wedged. The gap: releaseHeldLockForAbort delegates to the graceful fence release, whose waitForRetainedLockIdle is unbounded. A retained-lock transcript write pinned behind a hung provider stream (one that never settles even after runAbortController.abort() — e.g. a stalled streamGenerateContent) keeps the retained-use count nonzero forever, so the abort release never reaches lock.release(). openclaw#87278 / openclaw#88623 / openclaw#89811 all release on aborts that settle; nothing covered aborts a hung provider call never lets settle.

Fix

releaseHeldLockForAbort now runs a bounded variant:

  • Drain acquisition and the retained-idle wait are each raced against the abort settle bound (OPENCLAW_EMBEDDED_ABORT_SETTLE_TIMEOUT_MS, default 2s / 250ms under OPENCLAW_TEST_FAST), overridable per controller via abortReleaseTimeoutMs.
  • Retained writes that settle within the bound keep the existing graceful fence semantics (fence stays active, no behavior change — covered by the existing [Bug]: Session write lock leaked after embedded run timeout during subagent announce openclaw/openclaw#86816 tests).
  • When the bound expires, the underlying file lock is force-released and the controller is poisoned via takeoverDetected, so the aborted run cannot perform torn writes after losing ownership (its later withSessionWriteLock throws EmbeddedAttemptSessionTakeoverError; cleanup degrades to the noop lock — same posture as a real takeover).
  • If another release path owns the drain and is itself stuck, the abort path takes release ownership directly and hands the eventually-acquired drain straight back (no heldLockDraining leak; dispose() stays unblocked).

Tests

  • force-releases the held lock when a retained write never settles (tulgey#225) — the repro: never-settling retained write, bounded abort release, lock released once, controller poisoned.
  • abort release stays graceful when retained writes settle within the bound (tulgey#225) — no poisoning on the happy path.
  • Full attempt.session-lock.test.ts suite: 90/90. Adjacent suites (attempt-abort, attempt.abort-settle-timeout, command-queue): 39/39. tsgo typecheck clean. CodeRabbit: 0 findings.

Notes

  • Abort propagation (cancelling the provider fetch/stream itself) is the complementary half; signal plumbing exists in fetch-guard.ts (buildTimeoutAbortSignal composes caller signal + timeout) and the fetch-level timeout did eventually fire in the incident. This PR bounds the damage window regardless of provider behavior — the chokepoint fix.
  • Candidate for upstreaming to openclaw/openclaw alongside the fix(agents): release session write lock on timeout abort (#86816) openclaw/openclaw#87278 family.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests

    • Added comprehensive test cases for session lock abort-release behavior, covering forced release when operations don't settle and graceful release when they complete within timeout.
  • New Features

    • Added configurable timeout mechanism for session lock release during abort operations to improve handling of stuck write operations.

…unsettled retained writes

A turn timeout aborts the run, but releaseHeldLockForAbort delegated to the
graceful fence release, whose waitForRetainedLockIdle is unbounded. A retained
transcript write pinned behind a hung provider stream (one that ignores abort,
e.g. a stalled streamGenerateContent) therefore held the session .jsonl lock
until the maxHoldMs watchdog (5-30 min), and every turn in between failed with
SessionWriteLockTimeoutError.

releaseHeldLockForAbort now races the graceful path against the abort settle
bound (OPENCLAW_EMBEDDED_ABORT_SETTLE_TIMEOUT_MS). When the bound expires the
underlying file lock is force-released and the controller poisoned via
takeoverDetected, so the aborted run cannot perform torn writes after losing
ownership. Retained writes that settle within the bound keep the existing
graceful fence semantics.

Prod incident: tulgey#225 (membrane 2026-06-01..03). Completes the lock-leak
family of openclaw#87278 / openclaw#88623 / openclaw#89811, which release on settled aborts but not on
aborts a hung provider call never lets settle.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds timeout-bounded semantics to embedded attempt session lock abort-release paths. The controller now accepts an optional abortReleaseTimeoutMs parameter, and during abort-triggered lock release, a new bounded routine enforces a timeout using Promise.race. If the timeout expires before held locks settle, the lock is force-released and takeover poisoning prevents further writes. Two test cases validate the force-release and graceful-completion paths.

Changes

Bounded timeout for abort-release lock semantics

Layer / File(s) Summary
Abort-release timeout contract and bounded implementation
src/agents/embedded-agent-runner/run/attempt.session-lock.ts
createEmbeddedAttemptSessionLockController adds optional abortReleaseTimeoutMs parameter. New releaseHeldLockForAbortBounded routine races held-lock settlement against a timeout; on expiry, lock is force-released and takeoverDetected is set to prevent further writes. releaseHeldLockForAbort is updated to call the bounded routine, with default timeout resolved via resolveEmbeddedAbortSettleTimeoutMs.
Abort-release timeout validation
src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts
Two new tests: one drives a retained write to non-settling state, triggers abort with short timeout, and asserts force-release with takeover poisoning; the second ensures graceful settlement within a long timeout bound with no poisoning and normal subsequent operation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A lock that times out when abort's too slow,
Force-release the grip, let the timeout glow!
When writes are hung and won't retreat,
Takeover guards ensure safe retreat.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and specifically describes the main change: bounding abort-path session lock release with force-release on unsettled retained writes, matching the core fix for the production incident.
Description check ✅ Passed The PR description is comprehensive and follows the template structure with clear Problem, Fix, Tests, and Notes sections that explain intent, outcome, and validation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-225-abort-lock-release

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts (1)

166-231: ⚡ Quick win

Add coverage for the timeout branch where another release already owns the drain.

These two cases only exercise the timeout when releaseHeldLockForAbort() becomes the drain owner first. They never hit the drainOwner === "abort-bound" path, so a stuck releaseForPrompt()/dispose() can leave heldLockDraining wedged without this suite noticing. Please add a focused case that starts a graceful release first, then times out the abort path, and asserts dispose() or reacquireAfterPrompt() still resolve.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts` around
lines 166 - 231, Add a new unit test that exercises the timeout branch where the
graceful release becomes the drain owner before the abort path: create a
controller via createEmbeddedAttemptSessionLockController, start a graceful
release path (call whatever triggers the prompt release/releaseForPrompt or
dispose path by invoking withSessionWriteLock then calling
controller.releaseForPrompt()/dispose()), then concurrently call
controller.releaseHeldLockForAbort() with a very small abortReleaseTimeoutMs so
the abort path times out and obtains drainOwner === "abort-bound"; assert that
the abort path times out (i.e., releaseHeldLockForAbort resolves) and that
subsequent calls like controller.dispose() or controller.reacquireAfterPrompt()
still resolve (no wedged heldLockDraining), and verify release was called the
expected number of times. Ensure the test references releaseHeldLockForAbort,
releaseForPrompt/dispose (whichever exists), withSessionWriteLock, and
reacquireAfterPrompt to locate the implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/agents/embedded-agent-runner/run/attempt.session-lock.ts`:
- Around line 872-880: When the timeout path detects drainOwner ===
"abort-bound" it must revoke the drain immediately instead of waiting for
drainAcquisition to resolve; modify the branch in the raceWithBound handler
(around beginHeldLockDrain, drainAcquisition, drainOwner) to clear/reset the
heldLockDraining state right away (so
waitForHeldLockDrain/waitForRetainedLockIdle won't hang) and ensure the pending
drainAcquisition is detached (e.g., keep the void drainAcquisition.then(...) but
do not rely on its resolution to clear heldLockDraining), then call
forceRelease() and return; update finishHeldLockDrain usage so the orphaned
promise becomes a no-op and does not re-set heldLockDraining when it later
resolves.

---

Nitpick comments:
In `@src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts`:
- Around line 166-231: Add a new unit test that exercises the timeout branch
where the graceful release becomes the drain owner before the abort path: create
a controller via createEmbeddedAttemptSessionLockController, start a graceful
release path (call whatever triggers the prompt release/releaseForPrompt or
dispose path by invoking withSessionWriteLock then calling
controller.releaseForPrompt()/dispose()), then concurrently call
controller.releaseHeldLockForAbort() with a very small abortReleaseTimeoutMs so
the abort path times out and obtains drainOwner === "abort-bound"; assert that
the abort path times out (i.e., releaseHeldLockForAbort resolves) and that
subsequent calls like controller.dispose() or controller.reacquireAfterPrompt()
still resolve (no wedged heldLockDraining), and verify release was called the
expected number of times. Ensure the test references releaseHeldLockForAbort,
releaseForPrompt/dispose (whichever exists), withSessionWriteLock, and
reacquireAfterPrompt to locate the implementation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c3b5f49f-9adb-486d-921b-dc5ff5cdbe00

📥 Commits

Reviewing files that changed from the base of the PR and between cad7da3 and 3ff02b7.

📒 Files selected for processing (2)
  • src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts
  • src/agents/embedded-agent-runner/run/attempt.session-lock.ts

Comment on lines +872 to +880
const drainAcquisition = beginHeldLockDrain();
const drainOwner = await raceWithBound(drainAcquisition);
if (drainOwner === "abort-bound") {
// Another release path owns the drain and is itself stuck. Take release
// ownership directly (clearing `heldLock` makes the stuck path no-op when
// it resumes) and hand the eventually-acquired drain straight back.
void drainAcquisition.then((owner) => finishHeldLockDrain(owner));
await forceRelease();
return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Timeout force-release can still wedge the controller behind heldLockDraining.

If releaseForPrompt() or dispose() already owns the drain and is hung in waitForRetainedLockIdle(), this branch force-releases heldLock but never clears heldLockDraining. drainAcquisition only resolves after the original owner finishes, which is exactly the never-settling case you are trying to escape. After this timeout, dispose() and reacquireAfterPrompt() can still hang forever in waitForHeldLockDrain() even though the file lock was released. The timeout path needs to revoke/reset the drain immediately instead of waiting for the original owner to eventually hand it back.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/agents/embedded-agent-runner/run/attempt.session-lock.ts` around lines
872 - 880, When the timeout path detects drainOwner === "abort-bound" it must
revoke the drain immediately instead of waiting for drainAcquisition to resolve;
modify the branch in the raceWithBound handler (around beginHeldLockDrain,
drainAcquisition, drainOwner) to clear/reset the heldLockDraining state right
away (so waitForHeldLockDrain/waitForRetainedLockIdle won't hang) and ensure the
pending drainAcquisition is detached (e.g., keep the void
drainAcquisition.then(...) but do not rely on its resolution to clear
heldLockDraining), then call forceRelease() and return; update
finishHeldLockDrain usage so the orphaned promise becomes a no-op and does not
re-set heldLockDraining when it later resolves.

@matin matin merged commit aa5a446 into main Jun 3, 2026
133 of 150 checks passed
@matin matin deleted the issue-225-abort-lock-release branch June 3, 2026 23:25
matin added a commit that referenced this pull request Jun 5, 2026
…tled holders (#11)

tulgey#238 root cause (confirmed by core-dump heap analysis of a live stall):
withModelsJsonWriteLock chains each caller onto the prior holder's gate with
no timeout. A holder that never settles (turn aborted while run() awaited a
hung provider call) leaves the gate unresolved forever, so every subsequent
model-resolution in the process parks on a dead promise — silent per-process
deadlock: no error, no timeout, no log; runs stall at embedded_run:started.

The wait on the prior holder is now bounded (OPENCLAW_MODELS_JSON_LOCK_WAIT_MS,
default 60s); on expiry the caller logs and steals the lock. A rare concurrent
models.json write (atomic temp-file rename) is recoverable; a deadlocked agent
is not. Same family as the session-lock fixes (#6, upstream openclaw#87278).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
matin added a commit that referenced this pull request Jun 5, 2026
…ettled flushes (#15)

A flush whose downstream work never settles (hung pre-agent media call in
processMessage) parked every later inbound for that chat key on a dead promise
chain: messages logged as Inbound, never reached the command queue, and the
chat went permanently silent until restart re-primed the same poison. Fourth
member of the unbounded await-the-prior-holder family (session lock #6,
models.json #11, lane slot #12). Bounded at OPENCLAW_INBOUND_CHAIN_WAIT_MS
(default 5 min) with a warn naming the chat key.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant