fix(agents): abandon hung in-flight write lock on attempt cleanup (#84193)#84220
fix(agents): abandon hung in-flight write lock on attempt cleanup (#84193)#84220yetval wants to merge 2 commits into
Conversation
Pi auto-compaction wraps `session.compact()` under the session JSONL write lock via `installSessionExternalHookWriteLock`. When the compact() call hangs (post-run, after the outer abort/timeout has already fired), the reacquired lock is held by an `await run()` body that never settles, so the finally block that releases it never executes either. Until the session-write-lock watchdog forces a release at `maxHoldMs` (default runTimeout + compaction grace = many minutes), every later turn on the same session bounces off `SessionWriteLockTimeoutError` at the 60s acquire timeout. The only observed recovery was a Gateway restart. Track each reacquired write lock and abandon any still-held entries when `acquireForCleanup` runs. The transcript fence already guards against partial-write divergence on the next acquire, and the existing finally path stays correct (release is idempotent via the entry flag). Refs openclaw#84193
|
Codex review: needs maintainer review before merge. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. with high confidence from source inspection and the PR's real-fs proof: current main can hold a reacquired session write lock until the wrapped compaction/hook run settles, while cleanup later contends with that same lock. I did not run local tests because this review is constrained to read-only inspection. PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. PR egg Rarity: 🥚 common. What is this egg doing here?
Real behavior proof Mantis proof suggestion Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the bounded controller-level abandonment fix if maintainers accept the real-fs lock proof, with optional live Gateway/Discord proof to reduce the remaining transport-level uncertainty. Do we have a high-confidence way to reproduce the issue? Yes, with high confidence from source inspection and the PR's real-fs proof: current main can hold a reacquired session write lock until the wrapped compaction/hook run settles, while cleanup later contends with that same lock. I did not run local tests because this review is constrained to read-only inspection. Is this the best way to solve the issue? Yes, the patch targets the controller that owns the reacquired locks and keeps the existing takeover fence rather than adding a broad runtime fallback. Explicit compaction cancellation would be a larger design, but it is not required to fix this lock leak. Label justifications:
Acceptance criteria:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 78d226bb3b69. |
…stic stderr
Add real-file-system integration test that exercises the actual
acquireSessionWriteLock + on-disk .jsonl.lock semantics (no mocks):
- Hold the lock through a stuck withSessionWriteLock and verify a
competing acquire on the same session file fails with
SessionWriteLockTimeoutError (matches the user-reported 60s
timeout on later Discord turns).
- Run attempt cleanup, verify the lock is released, the next
competing acquire succeeds in <5ms, and the .jsonl bytes remain
intact (no transcript tear).
- Verify the new stderr diagnostic line names the abandoned owner
pid and session file — what maintainers see in
`journalctl -u openclaw` after the bug fires.
Make abandonInFlightWriteLocks await its releases so the on-disk lock
file is gone before cleanup returns; without it, concurrent acquires
race the release and bounce off "file lock stale" until the watchdog
catches up.
|
@clawsweeper re-review Added real-fs integration test (no mocks) using the actual |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
@clawsweeper automerge |
|
🦞🔧
Draft PRs stay fix-only until GitHub marks them ready for review. Pause with Automerge progress:
|
|
ClawSweeper 🐠 reef update Thanks for the work on this. ClawSweeper did not have permission to update this branch directly, so it opened a narrow replacement PR instead. that's a branch access thing, not a knock on the contribution. Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
fish notes: model gpt-5.5, reasoning high; reviewed against c1b3143. |
Summary
Fixes #84193 — post-run Pi auto-compaction can hang while holding the session JSONL write lock. After the run's outer abort/timeout fires the wrapped
await run()body never settles, so the lock-controller'sfinallynever runs, so the lock keeps living in-process for the entiremaxHoldMswindow (defaults to runTimeout + 900s compaction grace = many minutes). Every later turn on the same Discord channel then bounces offSessionWriteLockTimeoutError: session file locked (timeout 60000ms)until a Gateway restart.This change tracks every lock reacquired through
withSessionWriteLockand abandons the still-held entries fromacquireForCleanup, so the next turn can acquire the session file immediately. The existing transcript fence still rejects later writes that would have raced a partial-compaction mutation.What changed
src/agents/pi-embedded-runner/run/attempt.session-lock.tsInFlightLockEntryand add it to a per-controllerSet.abandonInFlightWriteLocks()flips the controller into takeover state and awaits force-release of every still-held entry (idempotent via the entry'sreleasedflag).acquireForCleanup()calls it before re-acquiring, so a hung compact() can never keep the JSONL lock alive past attempt teardown. Also releases the lingeringheldLock(initial coarse lock) if it was still held when takeover was detected.journalctl -u openclawtraces with the cleanup path:[session-write-lock] abandoned N in-flight lock(s) on attempt cleanup: sessionFile=... owner=pid=... reason=stuck-compaction-or-hook.src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.test.ts(new, unit)withSessionWriteLock(() => never-resolve)and proves cleanup releases it.EmbeddedAttemptSessionTakeoverErrorfence path.src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.integration.test.ts(new, real fs)acquireSessionWriteLockagainst a real tmpsession.jsonl(no mocks). Holds the lock through a stuck run, verifies a competingacquireSessionWriteLockfrom a separate caller fails withSessionWriteLockTimeoutErrorBEFORE cleanup, then runs cleanup and verifies the same competing acquire succeeds AFTER cleanup. Captures the diagnostic line and confirms the on-disk.jsonlbytes are not torn.Real behavior proof
Behavior addressed: post-run Pi auto-compaction holds the session JSONL write lock past the run's outer timeout, blocking every later turn for the same session with
SessionWriteLockTimeoutErroruntil Gateway restart. (#84193)Real environment tested: macOS arm64, Node v25.9.0, vitest 4.1.6 from
node scripts/run-vitest.mjs. Realnode:fslock file + realacquireSessionWriteLock+ realcreateEmbeddedAttemptSessionLockController— no mocks for the lock subsystem.Exact steps or command run after this patch:
Evidence after fix (vitest summary):
Real-fs trace from a standalone repro script (real
acquireSessionWriteLock, no vitest, no mocks — captured locally with tmp paths redacted to$TMP):This is the same shape of evidence the issue reporter would see: lock file on disk with the live Gateway PID before cleanup, the
SessionWriteLockTimeoutErrormatching the user-reported error name and pid format (pid=N), the stderr diagnostic line on abandon, then immediate acquire success on the same session file and a verified non-tornsession.jsonl. Same regression suite onupstream/mainwithout this patch fails at:Observed result after fix: cleanup releases the stuck reacquired lock immediately. The 60s acquire timeout no longer blocks the next turn. Transcript fencing still rejects further writes after abandon via the existing
EmbeddedAttemptSessionTakeoverError. Diagnostic stderr line is observable so a maintainer runningjournalctl -u openclawafter the failure will see which session file/pid was abandoned.What was not tested: live Discord + Anthropic Opus session against a running Gateway. The integration test covers the file-lock subsystem semantics that the bug actually leaks (on-disk
.jsonl.lock, the controller's reacquire path, the cleanup hook called fromattempt.ts:4832); a Discord/Opus run would only add transport plumbing on top of the same locking code, so the lock-side behavior is fully covered here.Risk addressed
ClawSweeper flagged two session-state-sensitive risks. Addressed:
acquireForCleanup(), which the embedded attempt runner only calls in its outerfinallyblock (attempt.ts:4832). At that point the run's outer abort/timeout has already fired, so any late compaction write is already on an error path. The transcript fence (assertSessionFileFence) on the next attempt's controller catches any post-abandon byte-level divergence on its firstwithSessionWriteLockcall, so a late compaction write cannot silently corrupt a future turn.lock.release()before returning, so the.jsonl.lockfile is gone on disk before cleanup returns. The integration test exercises exactly this race (competing acquire from the same process fired immediately after cleanup) and confirms it succeeds without bouncing off the file-lock-manager's stale-lock detection.Test plan
node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.integration.test.ts— real-fs proof, 2/2 passnode scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.test.ts— unit regression, 4/4 passnode scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts— pre-existing suite, 28/28 pass (unchanged)node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/compaction-retry-aggregate-timeout.test.ts— passnode scripts/run-vitest.mjs src/agents/session-write-lock.test.ts— passnode scripts/run-oxlint.mjson touched files — 0 errors / 0 warningspnpm exec oxfmt --checkon touched files — formatted