Skip to content

fix(agents): reclaim session write-locks held past the holder's own maxHoldMs#89673

Open
openperf wants to merge 1 commit into
openclaw:mainfrom
openperf:fix/87483-session-writelock-stale-reclaim
Open

fix(agents): reclaim session write-locks held past the holder's own maxHoldMs#89673
openperf wants to merge 1 commit into
openclaw:mainfrom
openperf:fix/87483-session-writelock-stale-reclaim

Conversation

@openperf

@openperf openperf commented Jun 3, 2026

Copy link
Copy Markdown
Member

Summary

  • Problem: Issue [Bug]: Session file lock not released properly by watchdog #87483 reports that a session write-lock file held by a live gateway process persists for hours past maxHoldMs/staleMs; the watchdog never reclaims it, so later requests fail with session file locked (timeout 60000ms): pid=…. The reporter's holder process was alive and busy (83% CPU) while holding the lock for 8+ hours despite the lock recording maxHoldMs: 1020000 (17 min).
  • Root Cause: There are two reclaim mechanisms and both fail for a live-but-stuck holder:
    1. The per-process watchdog (ensureWatchdogStarted) is a setInterval running inside the holding process. If that process is stuck in a synchronous CPU-bound loop, its event loop is blocked and the watchdog timer never fires, so the holder never self-releases.
    2. A contending process inspects the lock and, at acquireSessionWriteLock time (with respectMaxHold on), correctly flags it hold-exceeded (age past the holder's own recorded maxHoldMs). But shouldRemoveContendedLockFile (the shouldRemoveStaleLock callback the lock library consults under staleRecovery: "remove-if-unchanged") treats both too-old and hold-exceeded as report-only: REPORT_ONLY_STALE_LOCK_REASONS = {"too-old", "hold-exceeded"}. When a live holder's only stale reasons are these, the function returns false (don't remove), the library throws file_lock_stale, and the acquire eventually fails — the lock is never reclaimed even though it is past the holder's own declared max hold. So a live-but-stuck holder pins the lock forever: it cannot self-release (event loop blocked) and contenders are forbidden from reclaiming it.
  • Fix: Remove "hold-exceeded" from REPORT_ONLY_STALE_LOCK_REASONS. A holder past its own recorded maxHoldMs is overdue by its own contract and is reclaimable by a contender. "too-old" (past the global staleMs heuristic) stays report-only — a live holder may still be working within its own declared maxHoldMs, so we do not steal its lock on the global heuristic alone. The acquire path's remove-if-unchanged recovery still skips a lock whose file has changed (e.g. a concurrent release), so only a genuinely stalled holder loses the lock.
  • What changed:
    • src/agents/session-write-lock.ts — drop "hold-exceeded" from REPORT_ONLY_STALE_LOCK_REASONS (one element + an explanatory comment).
    • src/agents/session-write-lock.test.ts — add an integration regression: a live OpenClaw-owned lock past its own maxHoldMs (but within staleMs) is now reclaimed by acquireSessionWriteLock.
    • src/config/schema.help.ts and docs/reference/session-management-compaction.md — document that session.writeLock.maxHoldMs is the hard hold limit enforced both by the holder's watchdog and by a contending reclaim (help string + reference prose only).
  • What did NOT change (scope boundary):
    • "too-old" stays report-only, so a live holder past the global staleMs but within its own maxHoldMs is still respected (covered by the existing "reports live OpenClaw-owned stale locks without removing them" test).
    • The cleanStaleLockFiles doctor sweep does not pass respectMaxHold and therefore never produces hold-exceeded; its behavior is unchanged (it still force-removes only dead/recycled/orphan locks, never a live holder's).
    • No new config key, schema field, default, env var, or doctor migration; the only config/docs touch is the existing session.writeLock.maxHoldMs help string and its reference doc, updated to match the behavior. No plugin SDK surface, no dependency changes. The watchdog interval, timeout, staleMs, and maxHoldMs resolution are untouched.

Reproduction

  1. A long-running gateway holds a session write-lock and gets stuck in a synchronous CPU loop (event loop blocked).
  2. The holder's in-process watchdog can't fire, so the lock is never self-released past maxHoldMs.
  3. Another request tries to write the same session; it sees the lock past the holder's maxHoldMs but cannot reclaim it.
  4. Before this PR: the contender fails with a stale/session file locked error; the lock persists until the gateway restarts or the file is deleted manually.
  5. After this PR: the contender reclaims the lock (the stalled holder's overdue lock is removed-if-unchanged) and proceeds.

Deterministic source reproduction (added as a regression test): a live OpenClaw-owned lock file with maxHoldMs: 1000 and createdAt 30s ago, acquired with a large staleMs. Before the fix the acquire throws SessionWriteLockStaleError: session file lock stale (hold-exceeded): … alive=true; after the fix it reclaims the lock.

Real behavior proof

Behavior addressed (#87483 — a stalled live holder's lock is reclaimable): a session write-lock held past the holder's own maxHoldMs by a live OpenClaw process is reclaimed by a contender instead of pinning the session indefinitely.

Real environment tested (Linux, Node — Vitest against the production acquireSessionWriteLock with a real spawned OpenClaw-recognized owner process; no live gateway): verified via session-write-lock.test.ts, plus core type-check, lint, and format on the changed files. A live overnight gateway lockup was not reproduced.

Exact steps or command run after this patch (repo-root): node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts; node scripts/run-tsgo.mjs -p tsconfig.core.json; oxfmt --check and oxlint on the changed files.

Evidence after fix (Vitest output for the touched test file):

 session-write-lock.test.ts   Tests  54 passed (54)

The new case fails on current main exactly as the issue describes — SessionWriteLockStaleError: session file lock stale (hold-exceeded): pid=… alive=true ageMs=30005 — and passes after dropping "hold-exceeded" from the report-only set, while all 53 existing cases (including the conservative "report live too-old locks without removing them") stay green.

Observed result after fix (overdue live holder reclaimed; conservative live holder respected): a live holder past its own maxHoldMs is reclaimed; a live holder merely past the global staleMs (within its maxHoldMs) is still only reported, not removed.

What was not tested (live overnight gateway lockup): the multi-hour real-gateway lock retention was not reproduced live; the proof is the deterministic module-level reproduction.

Risk / Mitigation

  • Risk: reclaiming a live holder's lock could race with a holder still writing the transcript. Mitigation: no new race is introduced. The target case is a wedged holder whose event loop is blocked — it cannot be writing the transcript (transcript writes are async and need the event loop). A holder that is still making progress is already released at maxHoldMs by its own in-process watchdog (runLockWatchdogCheck force-releases without aborting in-flight work), so maxHoldMs was already the hard hold limit before this change; the cross-process reclaim only restores that limit for the wedged case where the watchdog cannot fire. remove-if-unchanged additionally skips a lock whose file changed.
  • Risk: behavior change in session-state concurrency. Mitigation: the change narrows report-only to the global staleMs heuristic; it does not touch thresholds, the watchdog, or the doctor sweep, and the conservative live-too-old behavior is preserved (covered by an existing test).

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Agents / session write-lock

Linked Issue/PR

Fixes #87483

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS labels Jun 3, 2026
@clawsweeper

clawsweeper Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 3, 2026, 2:15 AM ET / 06:15 UTC.

Summary
The PR removes hold-exceeded from report-only session write-lock stale reasons, adds a live-owner regression test, and updates maxHoldMs help/docs.

PR surface: Source +3, Tests +28, Docs +3. Total +34 across 4 files.

Reproducibility: yes. at source level: current main records hold-exceeded but keeps it report-only, and the PR adds a deterministic live-owner test for that path. The overnight gateway lockup itself was not reproduced live.

Review metrics: 1 noteworthy metric.

  • Config behavior surface: 1 existing setting behavior broadened, 0 new settings. session.writeLock.maxHoldMs changes from a documented holder-watchdog threshold to a cross-process contender-reclaim deadline, which is compatibility-sensitive before merge.

Merge readiness
Overall: 🦐 gold shrimp
Proof: 🦐 gold shrimp
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Get maintainer acceptance that maxHoldMs is the hard cross-process reclaim deadline for unchanged live-holder locks.

Risk before merge

  • [P1] Merging changes session-state concurrency: a live OpenClaw holder can lose an unchanged transcript lock after its recorded maxHoldMs, so users relying on the older watchdog-only interpretation need maintainer acceptance of the new contract.
  • [P1] The overnight gateway hang was not reproduced end to end; the proof is source-level with a spawned live owner process plus the fs-safe remove-if-unchanged contract.

Maintainer options:

  1. Accept maxHoldMs as the reclaim deadline (recommended)
    Land if maintainers agree the existing maxHoldMs setting should bound both holder watchdog release and contender reclaim of unchanged overdue locks.
  2. Pause for an explicit reclaim policy
    If maxHoldMs must remain watchdog-only, pause this PR and design a separate opt-in contender-reclaim policy with upgrade docs and tests.

Next step before merge

  • [P2] The remaining action is maintainer acceptance of session-state reclaim semantics; there is no narrow automated repair left after the docs/help update.

Security
Cleared: No concrete security or supply-chain regression was found; the diff does not change dependencies, workflows, secrets, package resolution, or downloaded code paths.

Review details

Best possible solution:

Land this only if maintainers accept maxHoldMs as the hard cross-process reclaim deadline for unchanged live-holder locks, keeping the updated help/docs and regression test with the code change.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level: current main records hold-exceeded but keeps it report-only, and the PR adds a deterministic live-owner test for that path. The overnight gateway lockup itself was not reproduced live.

Is this the best way to solve the issue?

Yes, if maintainers accept the policy: the patch is a narrow owner-boundary fix that uses fs-safe's remove-if-unchanged contract and now aligns docs/help. A separate opt-in policy would be safer only if maxHoldMs must remain watchdog-only.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against f47277871791.

Label changes

Label changes:

  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Not applicable: The external-contributor real-proof gate does not apply to this member-authored PR; the body includes deterministic module-level output but no live overnight gateway proof.
  • remove status: ⏳ waiting on author: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

  • P2: This is a bounded session-state reliability fix with real user impact, but it is not a crash loop, data-loss emergency, or security issue.
  • merge-risk: 🚨 compatibility: The PR changes the operational meaning of an existing session.writeLock.maxHoldMs setting for upgraded users who tune lock behavior.
  • merge-risk: 🚨 session-state: The PR changes when another process may remove a live holder's session transcript lock.
  • rating: 🦐 gold shrimp: Overall readiness is 🦐 gold shrimp; proof is 🦐 gold shrimp and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Not applicable: The external-contributor real-proof gate does not apply to this member-authored PR; the body includes deterministic module-level output but no live overnight gateway proof.
Evidence reviewed

PR surface:

Source +3, Tests +28, Docs +3. Total +34 across 4 files.

View PR surface stats
Area Files Added Removed Net
Source 2 5 2 +3
Tests 1 28 0 +28
Docs 1 5 2 +3
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 4 38 4 +34

What I checked:

  • Repository policy read: Read root AGENTS.md plus scoped src/agents and docs guidance; session-state/config compatibility guidance affected the review. (AGENTS.md:1, f47277871791)
  • Current main report-only behavior: Current main treats both too-old and hold-exceeded as report-only stale reasons, so shouldRemoveContendedLockFile returns false when those are the only stale reasons. (src/agents/session-write-lock.ts:49, f47277871791)
  • Current main hold-exceeded detection: Current main adds hold-exceeded when respectMaxHold is true and the observed lock age exceeds the holder-recorded maxHoldMs. (src/agents/session-write-lock.ts:531, f47277871791)
  • PR policy change: The PR head removes hold-exceeded from REPORT_ONLY_STALE_LOCK_REASONS, making holder maxHoldMs expiry eligible for removal through the existing remove-if-unchanged path. (src/agents/session-write-lock.ts:52, 25db525072d7)
  • Regression coverage on PR head: The new test creates a live OpenClaw-recognized owner whose recorded maxHoldMs is exceeded while staleMs is not, then expects the contender to acquire the lock. (src/agents/session-write-lock.test.ts:470, 25db525072d7)
  • Docs and help aligned on PR head: The PR updates both schema help and public session docs to describe maxHoldMs as a hard hold limit enforced by the holder watchdog and by contender reclaim when the lock file is unchanged. (src/config/schema.help.ts:1653, 25db525072d7)

Likely related people:

  • @Grynn: Authored the stale session lock watchdog change that introduced the maxHoldMs release contract and central session-write-lock behavior. (role: feature history contributor; confidence: medium; commits: e91a5b021648; files: src/agents/session-write-lock.ts, src/agents/session-write-lock.e2e.test.ts)
  • @vincentkoc: Recently hardened recycled-PID session lock recovery in the same lock ownership and stale-reclaim area. (role: recent adjacent contributor; confidence: medium; commits: 5a2200b28003; files: src/agents/session-write-lock.ts, src/agents/session-write-lock.test.ts, src/shared/pid-alive.ts)
  • @giodl73-repo: Recent squash history touched the session management docs that define the user-facing lock configuration contract. (role: recent docs/session area contributor; confidence: low; commits: 1d3cfc4b0168; files: docs/reference/session-management-compaction.md, docs/concepts/session.md)
  • @steipete: Appears as committer on the watchdog history and author on nearby agent/session runtime refactors, making this a plausible routing candidate for maintainer policy review. (role: merger and adjacent area contributor; confidence: low; commits: e91a5b021648, bb46b79d3c14; files: src/agents/session-write-lock.ts, src/agents/embedded-agent-runner/compact.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P2 Normal backlog priority with limited blast radius. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels Jun 3, 2026
@openperf openperf force-pushed the fix/87483-session-writelock-stale-reclaim branch from bd601ba to 8009ced Compare June 3, 2026 05:15
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels Jun 3, 2026
@openperf openperf force-pushed the fix/87483-session-writelock-stale-reclaim branch from 8009ced to 46c4ccf Compare June 3, 2026 05:53
@clawsweeper clawsweeper Bot added the merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. label Jun 3, 2026
@openperf openperf force-pushed the fix/87483-session-writelock-stale-reclaim branch from 46c4ccf to 25db525 Compare June 3, 2026 06:08
@openclaw-barnacle openclaw-barnacle Bot added the docs Improvements or additions to documentation label Jun 3, 2026
@clawsweeper clawsweeper Bot added status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 3, 2026
@anyech

anyech commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

This PR looks like the right safety-net layer for live-gateway locks that have exceeded the holder's own maxHoldMs.

I hit the same broader failure family again on a Discord session. The observed lock payload had a live gateway PID/starttime and maxHoldMs: 1020000; later same-session turns failed with SessionWriteLockTimeoutError while the gateway process was still alive.

One nuance from the fresh observation: some follow-up failures happened before maxHoldMs elapsed, immediately after:

stalled session ... activeTool=memory_search
stuck session recovery ... action=abort_embedded_run aborted=true drained=true
embedded abort settle timed out ... timeoutMs=2000
SessionWriteLockTimeoutError ... pid=<live-gateway-pid> .../<session>.jsonl.lock

So I think this PR is still valuable and necessary for the "past maxHold but live PID" case, but it is complementary to prompt-release / abort-release leak fixes like #89649. Together they cover:

  • release the lock promptly on abort/prompt-release cleanup when possible;
  • if that fails and the holder exceeds its own recorded maxHold, let a contender reclaim instead of wedging indefinitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling docs Improvements or additions to documentation merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P2 Normal backlog priority with limited blast radius. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. size: XS status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Session file lock not released properly by watchdog

2 participants