fix(agents): release session write lock on timeout abort (#86816)#87278
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 27, 2026, 7:15 PM ET / 23:15 UTC. Summary PR surface: Source +168, Tests +354. Total +522 across 3 files. Reproducibility: yes. for source and component reproduction, not for a live end-to-end provider run. Current main lacks a timeout release in Review metrics: none identified. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the timeout-specific lock release with the drain/fence contract intact once maintainers accept the session-state tradeoff and current-base required checks pass. Do we have a high-confidence way to reproduce the issue? Yes for source and component reproduction, not for a live end-to-end provider run. Current main lacks a timeout release in Is this the best way to solve the issue? Yes, this is the narrow implicated path: reuse the existing abort-release controller method from the timeout branch and strengthen controller race tests. The main unresolved question is maintainer acceptance of the early-release session-state tradeoff. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against c20a055341ac. Label changesLabel justifications:
Evidence reviewedPR surface: Source +168, Tests +354. Total +522 across 3 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Pearl Shellbean Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
fef2e49 to
b99372f
Compare
b99372f to
fa79977
Compare
fa79977 to
ecdcc4c
Compare
Co-authored-by: Chunyue Wang <16864032@qq.com>
ecdcc4c to
e66dc69
Compare
|
Maintainer update: I rebased this onto latest main and rewrote the branch to keep the fix on the current embedded-agent runner path. Behavior addressed: timeout abort now releases the retained session write lock, while draining any active retained-lock writer first and making cleanup/dispose/prompt release/reacquire wait for that drain.
|
…unsettled retained writes (#6) A turn timeout aborts the run, but releaseHeldLockForAbort delegated to the graceful fence release, whose waitForRetainedLockIdle is unbounded. A retained transcript write pinned behind a hung provider stream (one that ignores abort, e.g. a stalled streamGenerateContent) therefore held the session .jsonl lock until the maxHoldMs watchdog (5-30 min), and every turn in between failed with SessionWriteLockTimeoutError. releaseHeldLockForAbort now races the graceful path against the abort settle bound (OPENCLAW_EMBEDDED_ABORT_SETTLE_TIMEOUT_MS). When the bound expires the underlying file lock is force-released and the controller poisoned via takeoverDetected, so the aborted run cannot perform torn writes after losing ownership. Retained writes that settle within the bound keep the existing graceful fence semantics. Prod incident: tulgey#225 (membrane 2026-06-01..03). Completes the lock-leak family of openclaw#87278 / openclaw#88623 / openclaw#89811, which release on settled aborts but not on aborts a hung provider call never lets settle. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tled holders (#11) tulgey#238 root cause (confirmed by core-dump heap analysis of a live stall): withModelsJsonWriteLock chains each caller onto the prior holder's gate with no timeout. A holder that never settles (turn aborted while run() awaited a hung provider call) leaves the gate unresolved forever, so every subsequent model-resolution in the process parks on a dead promise — silent per-process deadlock: no error, no timeout, no log; runs stall at embedded_run:started. The wait on the prior holder is now bounded (OPENCLAW_MODELS_JSON_LOCK_WAIT_MS, default 60s); on expiry the caller logs and steals the lock. A rare concurrent models.json write (atomic temp-file rename) is recoverable; a deadlocked agent is not. Same family as the session-lock fixes (#6, upstream openclaw#87278). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
timeoutMs(600000ms in the reporter's environment) leaves the session write lock pinned in memory. All subsequent writes to that session fail withSessionWriteLockTimeoutErroruntil the gateway is restarted. Subagent announce retries (limit 3) all fail; the user sees no response on Feishu.runEmbeddedAttemptalready releases the session write lock from its outerfinally(releaseRetainedSessionLock?.()atsrc/agents/pi-embedded-runner/run/attempt.ts, shipped in SessionWriteLockTimeoutError: gateway never releases session file lock after embedded run timeout #86014/fix(agents): release embedded-attempt session lock on every exit path #86427), and the in-process session-write-lock module already runs a watchdog that force-releases held locks aftermaxHoldMs. Both safety nets exist. The remaining gap: when the embedded-run timeout fires (scheduleAbortTimer→abortRun(true)insrc/agents/pi-embedded-runner/run/attempt.ts), the inner stream / model call can fail to observe the abort signal and never unwind, sorunEmbeddedAttemptnever returns and the outerfinallynever runs. The watchdog still cleans the lock, but its window for the announce-style embedded attempt iscompactionTimeoutMs + 120s grace ≈ 17 min— too long for a user-facing chat bot. Operators restart the gateway well before the watchdog reclaims the lock, so from the user's view the lock is leaked "permanently until restart" (matching the issue's exact wording).abortRuninsrc/agents/pi-embedded-runner/run/attempt.tsthat, on theisTimeout === truepath, callssessionLockController.releaseHeldLockForAbort()immediately after the abort signal goes out. The helper already exists atsrc/agents/pi-embedded-runner/run/attempt.session-lock.ts(used by thesessions_yieldabort cleanup path) and is idempotent: it early-returns whenheldLockis already cleared. The outerfinally'sdispose()remains the canonical release path; this new branch shrinks the leak window from ~17 min (watchdog) to immediate.src/agents/pi-embedded-runner/run/attempt.ts—abortRun(isTimeout=true, ...)now callsvoid sessionLockController.releaseHeldLockForAbort().catch(log.warn)after the abort signal. Branch is gated onisTimeout === trueso manual cancel and non-timeout abort paths continue to rely on the outerfinally.src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts— new testreleaseHeldLockForAbort + dispose() are idempotent in succession (#86816)pins the idempotency contract the fix relies on.src/agents/pi-embedded-runner/run/attempt.session-lock.ts— controller API surface (releaseHeldLockForAbort,dispose) untouched; the fix reuses the existing helper.src/agents/session-write-lock.ts— file-lock acquire/release/watchdog logic untouched.finallyonly. No change toabortRun(false, ...)semantics.finallyand this new branch.docs/reference/configedits).extensions/api.ts/runtime-api.ts, registry, or loader edits).sessions.jsonformat, transcript jsonl format, or store-loader edits — relevant given the June SQLite migration).Reproduction
thinkingDefault: "adaptive"plus multiple subagent spawns (PPT generation, QA workflows).runEmbeddedAttemptnever returns; outerfinallynever runs; session write lock stays pinned in memory; all subsequent session writes fail withSessionWriteLockTimeoutErrorfor ~17 min (until the watchdog) — operators restart the gateway in the meantime, so the lock looks permanent.abortRun(true, ...)callsreleaseHeldLockForAbort()immediately, the session write lock is released, and the next subagent announce / inbound message succeeds without waiting for the watchdog or a restart.Real behavior proof
Behavior addressed (#86816): session write lock is released as soon as a timeout aborts a
runEmbeddedAttempt, even when the inner stream does not observe the abort andrunEmbeddedAttemptnever returns; non-timeout abort paths and the outerfinally'sdispose()remain canonical.Real environment tested (Linux, Node 22, real-component tsx harness driving production session-write-lock + createEmbeddedAttemptSessionLockController): tsx-resolved imports of production
acquireSessionWriteLock,createEmbeddedAttemptSessionLockController, andisSessionWriteLockTimeoutError. The harness mirrors theabortRuncall-sequence by creating a real controller, then for each scenario either calling or omittingreleaseHeldLockForAbort(), and probing the lock state via same-process re-acquire with a 1.5s timeout.Exact steps or command run after this patch:
node_modules/.bin/tsx /tmp/qmd86816/repro.mts(real-component BEFORE/AFTER lock-state sweep)node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.tspnpm exec oxfmt --check --threads=1 src/agents/pi-embedded-runner/run/attempt.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.tsEvidence after fix (verbatim real-component harness output):
Vitest output for the touched test file:
Observed result after fix:
isTimeout === false) continues to rely on the outerfinallyand the watchdog — no behavioral change for that path (Scenario C).releaseHeldLockForAbort + dispose() are idempotent in succession (#86816)test verifies that the outerfinally'sdispose()remains a safe no-op when the lock has already been released by the timeout branch.What was not tested:
runLockWatchdogCheck) — the fix bypasses the watchdog for the timeout path; the watchdog continues to be the safety net for any future code path that bypasses both the outerfinallyand this new branch.acquireSessionWriteLockcallers (src/agents/command/attempt-execution.ts,src/agents/pi-embedded-runner/tool-result-truncation.ts,src/agents/sandbox/registry.ts) — these are short-hold acquires with their owntry/finally; not part of the announce-embedded-run hang scenario.Regression tests:
attempt.session-lock.test.ts→ newreleaseHeldLockForAbort + dispose() are idempotent in succession (#86816)test plus the preserveddefensively releases the coarse attempt lock on sessions_yield abort cleanupandkeeps the session fence active after releasing for sessions_yield abort cleanuptests pin the controller-level contract this PR relies on.Risk / Mitigation
releaseHeldLockForAbort()is called eagerly before the inner stream unwinds; any innerwithSessionWriteLockthat has captured the lock by reference could in theory race with the new release. Mitigation: a late-arriving inner write is protected by the fence/takeover mechanism —releaseHeldLockWithFencesetsfenceActive = truebeforelock.release(), so any subsequentwithSessionWriteLockreacquires fresh viaacquireLock()and thenassertSessionFileFence()throwsEmbeddedAttemptSessionTakeoverErrorif the file fingerprint changed under it. Separately, same-process double-release is impossible becausecontroller.acquireWriteLock()returns{ lock, owned: false }when the call shares the controller'sheldLock, and the innerwithSessionWriteLock'sfinallyonly releases whenowned === true.isTimeout === truebranch. A non-timeout abort that hangs the inner stream would still leak the lock until the watchdog reclaims it. Mitigation: non-timeout aborts are user-initiated cancellations and follow a different lifecycle (the outerfinallyis reached because the caller awaits and unwinds normally). The watchdog continues to be the fallback. Adding the same release call to the non-timeout branch was considered but deferred — it would broaden the scope without a reported regression.releaseHeldLockForAbort()fromabortRuncould surface a previously-hidden double-release error path. Mitigation: pinned by the new idempotency test (releaseHeldLockForAbort + dispose()called four times in succession with one acquire and one release expected).src/agents/pi-embedded-runner/run/attempt.ts, an active area. Mitigation: change is 13 lines, additive, gated onisTimeout === true, uses an existing controller method already exercised by thesessions_yieldabort cleanup path. Recent main commits onpi-embedded-runner(cc704caa08,1710dac5eb,8c644ee611) confirm the module is under active maintenance.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Fixes #86816