Skip to content

fix(agents): abandon hung in-flight write lock on attempt cleanup (#84193)#84220

Closed
yetval wants to merge 2 commits into
openclaw:mainfrom
yetval:fix/84193-compaction-lock-leak
Closed

fix(agents): abandon hung in-flight write lock on attempt cleanup (#84193)#84220
yetval wants to merge 2 commits into
openclaw:mainfrom
yetval:fix/84193-compaction-lock-leak

Conversation

@yetval

@yetval yetval commented May 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #84193 — post-run Pi auto-compaction can hang while holding the session JSONL write lock. After the run's outer abort/timeout fires the wrapped await run() body never settles, so the lock-controller's finally never runs, so the lock keeps living in-process for the entire maxHoldMs window (defaults to runTimeout + 900s compaction grace = many minutes). Every later turn on the same Discord channel then bounces off SessionWriteLockTimeoutError: session file locked (timeout 60000ms) until a Gateway restart.

This change tracks every lock reacquired through withSessionWriteLock and abandons the still-held entries from acquireForCleanup, so the next turn can acquire the session file immediately. The existing transcript fence still rejects later writes that would have raced a partial-compaction mutation.

What changed

  • src/agents/pi-embedded-runner/run/attempt.session-lock.ts
    • Wrap each reacquired write lock in a small InFlightLockEntry and add it to a per-controller Set.
    • New abandonInFlightWriteLocks() flips the controller into takeover state and awaits force-release of every still-held entry (idempotent via the entry's released flag).
    • acquireForCleanup() calls it before re-acquiring, so a hung compact() can never keep the JSONL lock alive past attempt teardown. Also releases the lingering heldLock (initial coarse lock) if it was still held when takeover was detected.
    • Emit a stderr diagnostic line on abandon so maintainers can correlate journalctl -u openclaw traces with the cleanup path: [session-write-lock] abandoned N in-flight lock(s) on attempt cleanup: sessionFile=... owner=pid=... reason=stuck-compaction-or-hook.
  • src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.test.ts (new, unit)
    • Reproduces the leak with a stuck withSessionWriteLock(() => never-resolve) and proves cleanup releases it.
    • Asserts further writes after the abandon are rejected via the existing EmbeddedAttemptSessionTakeoverError fence path.
  • src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.integration.test.ts (new, real fs)
    • Uses the real acquireSessionWriteLock against a real tmp session.jsonl (no mocks). Holds the lock through a stuck run, verifies a competing acquireSessionWriteLock from a separate caller fails with SessionWriteLockTimeoutError BEFORE cleanup, then runs cleanup and verifies the same competing acquire succeeds AFTER cleanup. Captures the diagnostic line and confirms the on-disk .jsonl bytes are not torn.

Real behavior proof

Behavior addressed: post-run Pi auto-compaction holds the session JSONL write lock past the run's outer timeout, blocking every later turn for the same session with SessionWriteLockTimeoutError until Gateway restart. (#84193)

Real environment tested: macOS arm64, Node v25.9.0, vitest 4.1.6 from node scripts/run-vitest.mjs. Real node:fs lock file + real acquireSessionWriteLock + real createEmbeddedAttemptSessionLockController — no mocks for the lock subsystem.

Exact steps or command run after this patch:

node scripts/run-vitest.mjs \
  src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts \
  src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.test.ts \
  src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.integration.test.ts \
  src/agents/pi-embedded-runner/run/compaction-retry-aggregate-timeout.test.ts \
  src/agents/session-write-lock.test.ts

Evidence after fix (vitest summary):

 Test Files  10 passed (10)
      Tests  114 passed (114)
   Duration  1.11s

Real-fs trace from a standalone repro script (real acquireSessionWriteLock, no vitest, no mocks — captured locally with tmp paths redacted to $TMP):

[trace] sessionFile=$TMP/repro-84193-pVdd3y/session.jsonl
[trace] process.pid=72484
[trace] BEFORE cleanup: lock file owner=pid=72484
[trace] BEFORE cleanup: competing acquire failed after 1503ms with SessionWriteLockTimeoutError: session file locked (timeout 1500ms): pid=72484 $TMP/repro-84193-pVdd3y/session.jsonl.lock
[session-write-lock] abandoned 1 in-flight lock(s) on attempt cleanup: sessionFile=$TMP/repro-84193-pVdd3y/session.jsonl owner=pid=72484 reason=stuck-compaction-or-hook
[trace] AFTER cleanup:  competing acquire succeeded in 0ms
[trace] session.jsonl bytes intact: "{\"type\":\"session\"}\n"

This is the same shape of evidence the issue reporter would see: lock file on disk with the live Gateway PID before cleanup, the SessionWriteLockTimeoutError matching the user-reported error name and pid format (pid=N), the stderr diagnostic line on abandon, then immediate acquire success on the same session file and a verified non-torn session.jsonl. Same regression suite on upstream/main without this patch fails at:

embedded attempt session lock — stuck auto-compaction (#84193)
  ✗ abandons in-flight write lock when cleanup runs while a hung compact() still owns it
    expected "vi.fn()" to be called 1 times, but got 0 times
  ✗ rejects further withSessionWriteLock calls after abandoning a hung in-flight lock
    expected [Function] to throw error matching /session file changed/ but got
    'Cannot read properties of undefined (reading 'release')'

Observed result after fix: cleanup releases the stuck reacquired lock immediately. The 60s acquire timeout no longer blocks the next turn. Transcript fencing still rejects further writes after abandon via the existing EmbeddedAttemptSessionTakeoverError. Diagnostic stderr line is observable so a maintainer running journalctl -u openclaw after the failure will see which session file/pid was abandoned.

What was not tested: live Discord + Anthropic Opus session against a running Gateway. The integration test covers the file-lock subsystem semantics that the bug actually leaks (on-disk .jsonl.lock, the controller's reacquire path, the cleanup hook called from attempt.ts:4832); a Discord/Opus run would only add transport plumbing on top of the same locking code, so the lock-side behavior is fully covered here.

Risk addressed

ClawSweeper flagged two session-state-sensitive risks. Addressed:

  • Force-releasing in-flight compaction lock. Abandon only fires from acquireForCleanup(), which the embedded attempt runner only calls in its outer finally block (attempt.ts:4832). At that point the run's outer abort/timeout has already fired, so any late compaction write is already on an error path. The transcript fence (assertSessionFileFence) on the next attempt's controller catches any post-abandon byte-level divergence on its first withSessionWriteLock call, so a late compaction write cannot silently corrupt a future turn.
  • Race with concurrent in-process acquire. The abandon now awaits each lock.release() before returning, so the .jsonl.lock file is gone on disk before cleanup returns. The integration test exercises exactly this race (competing acquire from the same process fired immediately after cleanup) and confirms it succeeds without bouncing off the file-lock-manager's stale-lock detection.

Test plan

  • node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.integration.test.ts — real-fs proof, 2/2 pass
  • node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.test.ts — unit regression, 4/4 pass
  • node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts — pre-existing suite, 28/28 pass (unchanged)
  • node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/compaction-retry-aggregate-timeout.test.ts — pass
  • node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts — pass
  • node scripts/run-oxlint.mjs on touched files — 0 errors / 0 warnings
  • pnpm exec oxfmt --check on touched files — formatted
  • Live Discord/Opus auto-compaction reproduction against a running Gateway (left for maintainer / Crabbox / Mantis)

Pi auto-compaction wraps `session.compact()` under the session JSONL
write lock via `installSessionExternalHookWriteLock`. When the compact()
call hangs (post-run, after the outer abort/timeout has already fired),
the reacquired lock is held by an `await run()` body that never settles,
so the finally block that releases it never executes either.

Until the session-write-lock watchdog forces a release at `maxHoldMs`
(default runTimeout + compaction grace = many minutes), every later turn
on the same session bounces off `SessionWriteLockTimeoutError` at the
60s acquire timeout. The only observed recovery was a Gateway restart.

Track each reacquired write lock and abandon any still-held entries
when `acquireForCleanup` runs. The transcript fence already guards
against partial-write divergence on the next acquire, and the existing
finally path stays correct (release is idempotent via the entry flag).

Refs openclaw#84193
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S proof: supplied External PR includes structured after-fix real behavior proof. labels May 19, 2026
@clawsweeper

clawsweeper Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR tracks embedded-run session write locks reacquired after prompt release, abandons still-held in-flight locks during attempt cleanup, emits a diagnostic, and adds unit plus real-fs regression coverage for the stuck compaction lock path.

Reproducibility: yes. with high confidence from source inspection and the PR's real-fs proof: current main can hold a reacquired session write lock until the wrapped compaction/hook run settles, while cleanup later contends with that same lock. I did not run local tests because this review is constrained to read-only inspection.

PR rating
Overall: 🐚 platinum hermit
Proof: 🐚 platinum hermit
Patch quality: 🐚 platinum hermit
Summary: The PR is a focused, well-tested fix with sufficient lock-subsystem proof, while live transport proof remains a useful risk reducer for maintainers.

Rank-up moves:

  • Run or request a redacted live Discord/Gateway proof if maintainers want end-to-end assurance beyond the real-fs lock proof.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

PR egg
✨ Hatched: 🥚 common Sunspot Shellbean

        .--^^^^--.           
     .-'  o    o  '-.        
    /       \__/      \      
   |    /\  ____  /\   |     
   |   /  \/____\/  \  |     
    \  \_.------._/  /       
     '._  `----'  _.'        
        '-.____.-'           
       _/|_|  |_|\_          
      /__|      |__\         
       .-----------.         
      '-------------'        

Rarity: 🥚 common.
Trait: sniffs out flaky tests.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Sunspot Shellbean in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • How to hatch it: reach status: 👀 ready for maintainer look or status: 🚀 automerge armed; that usually means sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

Real behavior proof
Sufficient (live_output): The PR body now includes copied real-fs terminal output and an integration test using the actual session lock implementation, showing before-fix contention and after-cleanup acquire success with redacted temp paths.

Mantis proof suggestion
A visible live transport proof would materially reduce confidence risk because the reported user impact is a Discord channel recovering after a stuck compaction lock. A maintainer can ask Mantis to capture proof by posting a new PR comment that starts with the OpenClaw Mantis account mention, followed by:

visual task: capture a redacted Gateway/Discord run where a timed-out auto-compaction lock is abandoned and a follow-up same-channel turn replies instead of SessionWriteLockTimeoutError.

Risk before merge
Why this matters: - The patch intentionally releases a lock while the original compaction or hook promise may still be unresolved; it relies on takeover fencing and skipped cleanup flushing to prevent stale transcript writes from being trusted later.

  • The supplied proof exercises the real session lock file path, but not a live Discord/Gateway run; maintainers may still want end-to-end transport proof before merging a session-state-sensitive fix.

Maintainer options:

  1. Accept the lock-subsystem proof (recommended)
    Merge after maintainer review if the real-fs lock proof is considered sufficient coverage for the reported Discord failure path.
  2. Request live Gateway proof
    Ask for a redacted live Discord/Gateway or equivalent Mantis proof showing a timed-out compaction lock is abandoned and the next same-channel turn replies.
  3. Pause for cancellation design
    Pause this PR if maintainers decide force-releasing in-flight compaction is too risky and prefer explicit compaction cancellation or stronger transcript handoff.

Next step before merge
There is no narrow automated repair to queue; maintainers should decide whether the real-fs lock proof is enough or whether to require live Discord/Gateway proof before merge.

Security
Cleared: The diff does not add dependencies, workflow execution, package scripts, or credential handling; the new diagnostic logs local session path and pid similarly to existing lock errors.

Review details

Best possible solution:

Land the bounded controller-level abandonment fix if maintainers accept the real-fs lock proof, with optional live Gateway/Discord proof to reduce the remaining transport-level uncertainty.

Do we have a high-confidence way to reproduce the issue?

Yes, with high confidence from source inspection and the PR's real-fs proof: current main can hold a reacquired session write lock until the wrapped compaction/hook run settles, while cleanup later contends with that same lock. I did not run local tests because this review is constrained to read-only inspection.

Is this the best way to solve the issue?

Yes, the patch targets the controller that owns the reacquired locks and keeps the existing takeover fence rather than adding a broad runtime fallback. Explicit compaction cancellation would be a larger design, but it is not required to fix this lock leak.

Label justifications:

  • P1: The PR targets a regression that can make an active agent/channel session unusable until Gateway restart.
  • merge-risk: 🚨 session-state: The patch changes lock ownership and takeover behavior around session JSONL writes after a timed-out compaction path.
  • merge-risk: 🚨 message-delivery: If the fix is wrong, later turns in the same channel can still fail before the agent replies or can race transcript state.

Acceptance criteria:

  • node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.integration.test.ts src/agents/pi-embedded-runner/run/compaction-retry-aggregate-timeout.test.ts src/agents/session-write-lock.test.ts
  • node scripts/run-oxlint.mjs on touched files
  • pnpm exec oxfmt --check on touched files
  • Optional live Gateway/Discord proof for the auto-compaction timeout recovery path

What I checked:

  • Current main leak path: On current main, reacquired post-prompt locks are released only from the withSessionWriteLock finally block after the wrapped run settles, while acquireForCleanup later tries to reacquire the same session lock and only falls back after a timeout. (src/agents/pi-embedded-runner/run/attempt.session-lock.ts:315, 78d226bb3b69)
  • Cleanup caller: The embedded attempt outer cleanup path calls acquireForCleanup before cleanupEmbeddedAttemptResources and skips stale session flushing when takeover is detected, which is the right owner boundary for this repair. (src/agents/pi-embedded-runner/run/attempt.ts:4832, 78d226bb3b69)
  • PR implementation: The PR adds in-flight lock entries, drains and releases them before cleanup reacquisition, marks takeover, and returns the existing no-op cleanup lock after abandoning the stuck lock. (src/agents/pi-embedded-runner/run/attempt.session-lock.ts:295, ba40072d870d)
  • Real-fs regression proof: The new integration test uses the real acquireSessionWriteLock against a real temporary session.jsonl, confirms a competing acquire times out before cleanup, then confirms cleanup releases the lock and a competing acquire succeeds afterward without changing the JSONL bytes. (src/agents/pi-embedded-runner/run/attempt.session-lock.compaction-leak.integration.test.ts:37, ba40072d870d)
  • PR proof text: The PR body includes copied live terminal output from a standalone real-fs repro showing the pre-cleanup pid-owned lock, SessionWriteLockTimeoutError, the abandon diagnostic, immediate post-cleanup acquire success, and intact session bytes. (ba40072d870d)
  • Supply-chain scope: The diff touches only three TypeScript source/test files under the embedded runner session-lock surface and does not modify dependencies, workflows, package metadata, or release scripts. (ba40072d870d)

Likely related people:

  • Gio Della-Libera: Git blame and file history show the current embedded attempt session-lock controller, cleanup boundary, and session-write-lock implementation were introduced in the central migration/import commit. (role: introduced current implementation; confidence: high; commits: 8eb0a1777f08; files: src/agents/pi-embedded-runner/run/attempt.session-lock.ts, src/agents/pi-embedded-runner/run/attempt.ts, src/agents/session-write-lock.ts)
  • vincentkoc: Recent history shows related session lock hardening and session-lock listener fixes in the same agents/session-lock area. (role: adjacent session-lock contributor; confidence: medium; commits: 5a2200b28003, f00f0a95968f; files: src/agents/session-write-lock.ts, src/agents/session-write-lock.test.ts)
  • steipete: History shows nearby embedded runner and session write-lock release refactors across the affected agents surface. (role: adjacent embedded-runner contributor; confidence: medium; commits: f4782e1e730e, 761b71e268a7; files: src/agents/session-write-lock.ts, src/agents/pi-embedded-runner/run/attempt.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 78d226bb3b69.

@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels May 19, 2026
…stic stderr

Add real-file-system integration test that exercises the actual
acquireSessionWriteLock + on-disk .jsonl.lock semantics (no mocks):

  - Hold the lock through a stuck withSessionWriteLock and verify a
    competing acquire on the same session file fails with
    SessionWriteLockTimeoutError (matches the user-reported 60s
    timeout on later Discord turns).
  - Run attempt cleanup, verify the lock is released, the next
    competing acquire succeeds in <5ms, and the .jsonl bytes remain
    intact (no transcript tear).
  - Verify the new stderr diagnostic line names the abandoned owner
    pid and session file — what maintainers see in
    `journalctl -u openclaw` after the bug fires.

Make abandonInFlightWriteLocks await its releases so the on-disk lock
file is gone before cleanup returns; without it, concurrent acquires
race the release and bounce off "file lock stale" until the watchdog
catches up.
@yetval

yetval commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Added real-fs integration test (no mocks) using the actual acquireSessionWriteLock against a real .jsonl.lock file, plus a stderr diagnostic line on abandon. The PR body now includes a live trace showing the lock file with this process's pid before cleanup, the SessionWriteLockTimeoutError matching the user-reported format, the abandon diagnostic line, and the immediate post-cleanup acquire succeeding without a torn transcript. Risk concerns from the previous review addressed inline.

@clawsweeper

clawsweeper Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 19, 2026
@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper automerge

@clawsweeper clawsweeper Bot added the clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge label May 19, 2026
@clawsweeper

clawsweeper Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

🦞🔧
ClawSweeper automerge is enabled.

Draft PRs stay fix-only until GitHub marks them ready for review. Pause with /clawsweeper stop.

Automerge progress:

  • 2026-05-19 21:08:14 UTC review queued ba40072d870d (queued)

@clawsweeper

clawsweeper Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper 🐠 reef update

Thanks for the work on this. ClawSweeper did not have permission to update this branch directly, so it opened a narrow replacement PR instead. that's a branch access thing, not a knock on the contribution.

Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
Replacement PR: #84353
Why close: this run explicitly closes the superseded source PR after the credited replacement PR is open, so review continues in one place.
Closing this one because the run was configured to close superseded source PRs after opening the replacement.
Attribution stays attached; the replacement just gives the fix a writable branch.
Co-author credit kept:

fish notes: model gpt-5.5, reasoning high; reviewed against c1b3143.

@clawsweeper clawsweeper Bot closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Auto-compaction leaves session JSONL write lock held after timeout, blocking all later Discord turns

2 participants