Skip to content

fix(agents): release session write lock on timeout abort (#86816)#87278

Merged
steipete merged 2 commits into
openclaw:mainfrom
openperf:fix/86816-session-lock-leak-subagent-announce-timeout
May 27, 2026
Merged

fix(agents): release session write lock on timeout abort (#86816)#87278
steipete merged 2 commits into
openclaw:mainfrom
openperf:fix/86816-session-lock-leak-subagent-announce-timeout

Conversation

@openperf

@openperf openperf commented May 27, 2026

Copy link
Copy Markdown
Member

Summary

  • Problem: Issue [Bug]: Session write lock leaked after embedded run timeout during subagent announce #86816 reports that when a subagent completes and triggers an announce embedded run on the parent session, hitting the embedded-run timeoutMs (600000ms in the reporter's environment) leaves the session write lock pinned in memory. All subsequent writes to that session fail with SessionWriteLockTimeoutError until the gateway is restarted. Subagent announce retries (limit 3) all fail; the user sees no response on Feishu.
  • Root Cause: runEmbeddedAttempt already releases the session write lock from its outer finally (releaseRetainedSessionLock?.() at src/agents/pi-embedded-runner/run/attempt.ts, shipped in SessionWriteLockTimeoutError: gateway never releases session file lock after embedded run timeout #86014/fix(agents): release embedded-attempt session lock on every exit path #86427), and the in-process session-write-lock module already runs a watchdog that force-releases held locks after maxHoldMs. Both safety nets exist. The remaining gap: when the embedded-run timeout fires (scheduleAbortTimerabortRun(true) in src/agents/pi-embedded-runner/run/attempt.ts), the inner stream / model call can fail to observe the abort signal and never unwind, so runEmbeddedAttempt never returns and the outer finally never runs. The watchdog still cleans the lock, but its window for the announce-style embedded attempt is compactionTimeoutMs + 120s grace ≈ 17 min — too long for a user-facing chat bot. Operators restart the gateway well before the watchdog reclaims the lock, so from the user's view the lock is leaked "permanently until restart" (matching the issue's exact wording).
  • Fix: Add one branch to abortRun in src/agents/pi-embedded-runner/run/attempt.ts that, on the isTimeout === true path, calls sessionLockController.releaseHeldLockForAbort() immediately after the abort signal goes out. The helper already exists at src/agents/pi-embedded-runner/run/attempt.session-lock.ts (used by the sessions_yield abort cleanup path) and is idempotent: it early-returns when heldLock is already cleared. The outer finally's dispose() remains the canonical release path; this new branch shrinks the leak window from ~17 min (watchdog) to immediate.
  • What changed:
    • src/agents/pi-embedded-runner/run/attempt.tsabortRun(isTimeout=true, ...) now calls void sessionLockController.releaseHeldLockForAbort().catch(log.warn) after the abort signal. Branch is gated on isTimeout === true so manual cancel and non-timeout abort paths continue to rely on the outer finally.
    • src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts — new test releaseHeldLockForAbort + dispose() are idempotent in succession (#86816) pins the idempotency contract the fix relies on.
  • What did NOT change (scope boundary):
    • src/agents/pi-embedded-runner/run/attempt.session-lock.ts — controller API surface (releaseHeldLockForAbort, dispose) untouched; the fix reuses the existing helper.
    • src/agents/session-write-lock.ts — file-lock acquire/release/watchdog logic untouched.
    • Manual-cancel (non-timeout) abort path — still routes through the outer finally only. No change to abortRun(false, ...) semantics.
    • The watchdog continues to be the fallback for any path that bypasses both the outer finally and this new branch.
    • Config surface unchanged (no schema, defaults, doctor migrations, or docs/reference/config edits).
    • Plugin surface unchanged (no plugin SDK, manifest, extensions/api.ts / runtime-api.ts, registry, or loader edits).
    • Session store surface unchanged (no sessions.json format, transcript jsonl format, or store-loader edits — relevant given the June SQLite migration).

Reproduction

  1. On 2026.5.22, configure an agent with thinkingDefault: "adaptive" plus multiple subagent spawns (PPT generation, QA workflows).
  2. Have a long-running parent session with queued messages.
  3. A subagent completes; the announce flow starts an embedded run on the parent session.
  4. The model call hangs or runs past 600000ms.
  5. Before this PR: the abort signal fires but the inner stream does not observe it; runEmbeddedAttempt never returns; outer finally never runs; session write lock stays pinned in memory; all subsequent session writes fail with SessionWriteLockTimeoutError for ~17 min (until the watchdog) — operators restart the gateway in the meantime, so the lock looks permanent.
  6. After this PR: when the timeout aborts the run, abortRun(true, ...) calls releaseHeldLockForAbort() immediately, the session write lock is released, and the next subagent announce / inbound message succeeds without waiting for the watchdog or a restart.

Real behavior proof

Behavior addressed (#86816): session write lock is released as soon as a timeout aborts a runEmbeddedAttempt, even when the inner stream does not observe the abort and runEmbeddedAttempt never returns; non-timeout abort paths and the outer finally's dispose() remain canonical.

Real environment tested (Linux, Node 22, real-component tsx harness driving production session-write-lock + createEmbeddedAttemptSessionLockController): tsx-resolved imports of production acquireSessionWriteLock, createEmbeddedAttemptSessionLockController, and isSessionWriteLockTimeoutError. The harness mirrors the abortRun call-sequence by creating a real controller, then for each scenario either calling or omitting releaseHeldLockForAbort(), and probing the lock state via same-process re-acquire with a 1.5s timeout.

Exact steps or command run after this patch:

  1. node_modules/.bin/tsx /tmp/qmd86816/repro.mts (real-component BEFORE/AFTER lock-state sweep)
  2. node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
  3. pnpm exec oxfmt --check --threads=1 src/agents/pi-embedded-runner/run/attempt.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts

Evidence after fix (verbatim real-component harness output):

================ #86816 real-component lock-leak BEFORE/AFTER sweep ================

Drives production createEmbeddedAttemptSessionLockController + acquireSessionWriteLock.

--- SCENARIO A: BEFORE FIX (abortRun(true) does NOT release lock) ---
  controllerA created (heldLock acquired)
  simulating abortRun(true) without lock-release call...
  same-process re-acquire (1.5s timeout): held=true (1505ms)
  >>> BEFORE FIX RESULT: lock LEAKED (matches issue #86816 — gateway dead until watchdog ~17min or restart)

--- SCENARIO B: AFTER FIX (abortRun(true) calls releaseHeldLockForAbort) ---
  controllerB created (heldLock acquired)
  simulating abortRun(true) WITH lock-release call...
  same-process re-acquire (1.5s timeout): held=false (3ms)
  >>> AFTER FIX RESULT: lock released (matches expected — gateway recovers immediately)
  controllerB.dispose() called (idempotent — should be safe even after release-on-abort)

--- SCENARIO C: non-timeout abort (isTimeout=false) NOT affected ---
  controllerC created (heldLock acquired)
  simulating abortRun(false) — manual cancel, NOT timeout...
  same-process re-acquire (1.5s timeout): held=true (1505ms)
  >>> SAFETY: manual-cancel path lock still held (correct — outer-finally is responsible)

================ VERDICT ================
Scenario A (BEFORE FIX, timeout abort):  lock leaked = true  OK
Scenario B (AFTER FIX, timeout abort):   lock leaked = false  OK
Scenario C (manual cancel, not timeout): lock leaked = true  OK (outer-finally owns it)

PASS — fix changes timeout-abort behavior; non-timeout abort behavior preserved.

Vitest output for the touched test file:

 RUN  v4.1.7 /root/main/openclaw/.worktrees/pr-87107

 Test Files  2 passed (2)
      Tests  78 passed (78)
   Duration  7.91s

Observed result after fix:

  • Timeout-aborted embedded runs release the session write lock immediately (3ms in the harness vs the 1505ms re-acquire timeout BEFORE fix).
  • Manual-cancel abort (isTimeout === false) continues to rely on the outer finally and the watchdog — no behavioral change for that path (Scenario C).
  • The new releaseHeldLockForAbort + dispose() are idempotent in succession (#86816) test verifies that the outer finally's dispose() remains a safe no-op when the lock has already been released by the timeout branch.

What was not tested:

  • End-to-end live Feishu / PPT-workflow subagent announce. The real-component harness drives the production lock controller and lock helpers end-to-end but does not start a real announce / model HTTP call. The bug's necessary conditions (controller + abort + missing release) are reproduced deterministically.
  • Watchdog interaction (runLockWatchdogCheck) — the fix bypasses the watchdog for the timeout path; the watchdog continues to be the safety net for any future code path that bypasses both the outer finally and this new branch.
  • Other acquireSessionWriteLock callers (src/agents/command/attempt-execution.ts, src/agents/pi-embedded-runner/tool-result-truncation.ts, src/agents/sandbox/registry.ts) — these are short-hold acquires with their own try/finally; not part of the announce-embedded-run hang scenario.

Regression tests:

  • attempt.session-lock.test.ts → new releaseHeldLockForAbort + dispose() are idempotent in succession (#86816) test plus the preserved defensively releases the coarse attempt lock on sessions_yield abort cleanup and keeps the session fence active after releasing for sessions_yield abort cleanup tests pin the controller-level contract this PR relies on.

Risk / Mitigation

  • Risk: releaseHeldLockForAbort() is called eagerly before the inner stream unwinds; any inner withSessionWriteLock that has captured the lock by reference could in theory race with the new release. Mitigation: a late-arriving inner write is protected by the fence/takeover mechanism — releaseHeldLockWithFence sets fenceActive = true before lock.release(), so any subsequent withSessionWriteLock reacquires fresh via acquireLock() and then assertSessionFileFence() throws EmbeddedAttemptSessionTakeoverError if the file fingerprint changed under it. Separately, same-process double-release is impossible because controller.acquireWriteLock() returns { lock, owned: false } when the call shares the controller's heldLock, and the inner withSessionWriteLock's finally only releases when owned === true.
  • Risk: The fix only addresses the isTimeout === true branch. A non-timeout abort that hangs the inner stream would still leak the lock until the watchdog reclaims it. Mitigation: non-timeout aborts are user-initiated cancellations and follow a different lifecycle (the outer finally is reached because the caller awaits and unwinds normally). The watchdog continues to be the fallback. Adding the same release call to the non-timeout branch was considered but deferred — it would broaden the scope without a reported regression.
  • Risk: Calling releaseHeldLockForAbort() from abortRun could surface a previously-hidden double-release error path. Mitigation: pinned by the new idempotency test (releaseHeldLockForAbort + dispose() called four times in succession with one acquire and one release expected).
  • Risk: The fix lives in src/agents/pi-embedded-runner/run/attempt.ts, an active area. Mitigation: change is 13 lines, additive, gated on isTimeout === true, uses an existing controller method already exercised by the sessions_yield abort cleanup path. Recent main commits on pi-embedded-runner (cc704caa08, 1710dac5eb, 8c644ee611) confirm the module is under active maintenance.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway / orchestration
  • Sessions / storage

Linked Issue/PR

Fixes #86816

@clawsweeper

clawsweeper Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 27, 2026, 7:15 PM ET / 23:15 UTC.

Summary
The branch releases the retained embedded-session write lock on timeout abort and expands the session-lock controller/tests to serialize abort, prompt, cleanup, dispose, and retained-write paths.

PR surface: Source +168, Tests +354. Total +522 across 3 files.

Reproducibility: yes. for source and component reproduction, not for a live end-to-end provider run. Current main lacks a timeout release in abortRun(true), and the PR body provides a real-component harness showing the before/after lock state.

Review metrics: none identified.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

  • The patch intentionally changes session-lock lifecycle timing by releasing the retained lock on timeout before the embedded run has fully unwound; the new drain/fence tests reduce the race risk but this remains a session-state merge decision.
  • The provided proof is strong component-level terminal proof, but it does not include a live timed-out Feishu/provider announce run; maintainers should decide whether that gap is acceptable for this P1 fix.

Maintainer options:

  1. Merge with the drain/fence contract (recommended)
    Accept the intentional early timeout release because the patch waits for active retained-lock writers and fences later writes before reacquire/cleanup paths proceed.
  2. Require live timeout proof
    Ask for a live timed-out announce/provider scenario before merge if component-level lock proof is not enough for this session-state path.

Next step before merge
No automated repair is needed; maintainer action is to accept or reject the session-lock lifecycle tradeoff and merge after required checks.

Security
Cleared: The diff only touches TypeScript runtime lock handling and focused tests; it does not add dependency, CI, release, permission, credential, or supply-chain surfaces.

Review details

Best possible solution:

Land the timeout-specific lock release with the drain/fence contract intact once maintainers accept the session-state tradeoff and current-base required checks pass.

Do we have a high-confidence way to reproduce the issue?

Yes for source and component reproduction, not for a live end-to-end provider run. Current main lacks a timeout release in abortRun(true), and the PR body provides a real-component harness showing the before/after lock state.

Is this the best way to solve the issue?

Yes, this is the narrow implicated path: reuse the existing abort-release controller method from the timeout branch and strengthen controller race tests. The main unresolved question is maintainer acceptance of the early-release session-state tradeoff.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against c20a055341ac.

Label changes

Label justifications:

  • P1: The PR targets a reported session write-lock leak that can block a real agent/channel workflow until gateway restart.
  • merge-risk: 🚨 session-state: Merging changes when retained session locks are released during timeout aborts, which can affect transcript/session state if the drain or fence contract is wrong.
  • rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes terminal output from a real-component harness using production lock helpers plus focused test/format evidence; the maintainer comment adds a Node 24 focused shard and autoreview result.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes terminal output from a real-component harness using production lock helpers plus focused test/format evidence; the maintainer comment adds a Node 24 focused shard and autoreview result.
Evidence reviewed

PR surface:

Source +168, Tests +354. Total +522 across 3 files.

View PR surface stats
Area Files Added Removed Net
Source 2 213 45 +168
Tests 1 354 0 +354
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 3 567 45 +522

What I checked:

Likely related people:

  • steipete: Current-main blame for the embedded session-lock controller points to Peter Steinberger, and the recent current-main refactor moved session write-lock ownership into the owned session runtime. (role: recent area contributor; confidence: high; commits: 4da2b5f4d904, 5f68291f4f54; files: src/agents/embedded-agent-runner/run/attempt.session-lock.ts, src/agents/embedded-agent-runner/run/attempt.ts, src/agents/sessions/agent-session.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels May 27, 2026
@clawsweeper

clawsweeper Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

✨ Hatched: 🥚 common Pearl Shellbean

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: purrs at green checks.
Image traits: location branch lighthouse; accessory lint brush; palette rose quartz and slate; mood focused; pose standing beside its cracked shell; shell smooth pearl shell; lighting golden review-room light; background miniature CI buoys.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Pearl Shellbean in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@openperf openperf force-pushed the fix/86816-session-lock-leak-subagent-announce-timeout branch from fef2e49 to b99372f Compare May 27, 2026 12:33
@openclaw-barnacle openclaw-barnacle Bot added the agents Agent runtime and tooling label May 27, 2026
@steipete steipete self-assigned this May 27, 2026
@clawsweeper clawsweeper Bot added rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels May 27, 2026
@openperf openperf force-pushed the fix/86816-session-lock-leak-subagent-announce-timeout branch from b99372f to fa79977 Compare May 27, 2026 22:56
@steipete steipete force-pushed the fix/86816-session-lock-leak-subagent-announce-timeout branch from fa79977 to ecdcc4c Compare May 27, 2026 22:58
Co-authored-by: Chunyue Wang <16864032@qq.com>
@steipete steipete force-pushed the fix/86816-session-lock-leak-subagent-announce-timeout branch from ecdcc4c to e66dc69 Compare May 27, 2026 23:00
@steipete

Copy link
Copy Markdown
Contributor

Maintainer update: I rebased this onto latest main and rewrote the branch to keep the fix on the current embedded-agent runner path.

Behavior addressed: timeout abort now releases the retained session write lock, while draining any active retained-lock writer first and making cleanup/dispose/prompt release/reacquire wait for that drain.
Real environment tested: local macOS checkout, Node 24.15.0.
Exact steps or command run after this patch:

  • git diff --check origin/main...HEAD
  • fnm exec --using 24.15.0 -- pnpm test src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts -- --reporter=dot
  • .agents/skills/autoreview/scripts/autoreview --mode local --prompt "Final final review after reverting accidental formatter churn; only the intended 3-file timeout session-lock patch remains. Focus on actionable correctness regressions only."
    Evidence after fix: focused shard passed on the rebased commit with 40 tests; autoreview reported no accepted/actionable findings.
    Observed result after fix: timeout abort, cleanup, dispose, prompt release, prompt reacquire, and write-lock acquisition all serialize against the held-lock drain instead of racing the same lock release.
    What was not tested: live timed-out subagent run in a real provider session.

@steipete steipete merged commit 65fb565 into openclaw:main May 27, 2026
99 checks passed
matin added a commit to matin/openclaw that referenced this pull request Jun 3, 2026
…unsettled retained writes (#6)

A turn timeout aborts the run, but releaseHeldLockForAbort delegated to the
graceful fence release, whose waitForRetainedLockIdle is unbounded. A retained
transcript write pinned behind a hung provider stream (one that ignores abort,
e.g. a stalled streamGenerateContent) therefore held the session .jsonl lock
until the maxHoldMs watchdog (5-30 min), and every turn in between failed with
SessionWriteLockTimeoutError.

releaseHeldLockForAbort now races the graceful path against the abort settle
bound (OPENCLAW_EMBEDDED_ABORT_SETTLE_TIMEOUT_MS). When the bound expires the
underlying file lock is force-released and the controller poisoned via
takeoverDetected, so the aborted run cannot perform torn writes after losing
ownership. Retained writes that settle within the bound keep the existing
graceful fence semantics.

Prod incident: tulgey#225 (membrane 2026-06-01..03). Completes the lock-leak
family of openclaw#87278 / openclaw#88623 / openclaw#89811, which release on settled aborts but not on
aborts a hung provider call never lets settle.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
matin added a commit to matin/openclaw that referenced this pull request Jun 5, 2026
…tled holders (#11)

tulgey#238 root cause (confirmed by core-dump heap analysis of a live stall):
withModelsJsonWriteLock chains each caller onto the prior holder's gate with
no timeout. A holder that never settles (turn aborted while run() awaited a
hung provider call) leaves the gate unresolved forever, so every subsequent
model-resolution in the process parks on a dead promise — silent per-process
deadlock: no error, no timeout, no log; runs stall at embedded_run:started.

The wait on the prior holder is now bounded (OPENCLAW_MODELS_JSON_LOCK_WAIT_MS,
default 60s); on expiry the caller logs and steals the lock. A rare concurrent
models.json write (atomic temp-file rename) is recoverable; a deadlocked agent
is not. Same family as the session-lock fixes (#6, upstream openclaw#87278).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. size: L status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Session write lock leaked after embedded run timeout during subagent announce

2 participants