fix(agents): reclaim session write-locks held past the holder's own maxHoldMs#89673
fix(agents): reclaim session write-locks held past the holder's own maxHoldMs#89673openperf wants to merge 1 commit into
Conversation
|
Codex review: needs maintainer review before merge. Reviewed June 3, 2026, 2:15 AM ET / 06:15 UTC. Summary PR surface: Source +3, Tests +28, Docs +3. Total +34 across 4 files. Reproducibility: yes. at source level: current main records hold-exceeded but keeps it report-only, and the PR adds a deterministic live-owner test for that path. The overnight gateway lockup itself was not reproduced live. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land this only if maintainers accept maxHoldMs as the hard cross-process reclaim deadline for unchanged live-holder locks, keeping the updated help/docs and regression test with the code change. Do we have a high-confidence way to reproduce the issue? Yes, at source level: current main records hold-exceeded but keeps it report-only, and the PR adds a deterministic live-owner test for that path. The overnight gateway lockup itself was not reproduced live. Is this the best way to solve the issue? Yes, if maintainers accept the policy: the patch is a narrow owner-boundary fix that uses fs-safe's remove-if-unchanged contract and now aligns docs/help. A separate opt-in policy would be safer only if maxHoldMs must remain watchdog-only. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against f47277871791. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +3, Tests +28, Docs +3. Total +34 across 4 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
bd601ba to
8009ced
Compare
8009ced to
46c4ccf
Compare
46c4ccf to
25db525
Compare
|
This PR looks like the right safety-net layer for live-gateway locks that have exceeded the holder's own I hit the same broader failure family again on a Discord session. The observed lock payload had a live gateway PID/starttime and One nuance from the fresh observation: some follow-up failures happened before So I think this PR is still valuable and necessary for the "past maxHold but live PID" case, but it is complementary to prompt-release / abort-release leak fixes like #89649. Together they cover:
|
Summary
maxHoldMs/staleMs; the watchdog never reclaims it, so later requests fail withsession file locked (timeout 60000ms): pid=…. The reporter's holder process was alive and busy (83% CPU) while holding the lock for 8+ hours despite the lock recordingmaxHoldMs: 1020000(17 min).ensureWatchdogStarted) is asetIntervalrunning inside the holding process. If that process is stuck in a synchronous CPU-bound loop, its event loop is blocked and the watchdog timer never fires, so the holder never self-releases.acquireSessionWriteLocktime (withrespectMaxHoldon), correctly flags ithold-exceeded(age past the holder's own recordedmaxHoldMs). ButshouldRemoveContendedLockFile(theshouldRemoveStaleLockcallback the lock library consults understaleRecovery: "remove-if-unchanged") treats bothtoo-oldandhold-exceededas report-only:REPORT_ONLY_STALE_LOCK_REASONS = {"too-old", "hold-exceeded"}. When a live holder's only stale reasons are these, the function returnsfalse(don't remove), the library throwsfile_lock_stale, and the acquire eventually fails — the lock is never reclaimed even though it is past the holder's own declared max hold. So a live-but-stuck holder pins the lock forever: it cannot self-release (event loop blocked) and contenders are forbidden from reclaiming it."hold-exceeded"fromREPORT_ONLY_STALE_LOCK_REASONS. A holder past its own recordedmaxHoldMsis overdue by its own contract and is reclaimable by a contender."too-old"(past the globalstaleMsheuristic) stays report-only — a live holder may still be working within its own declaredmaxHoldMs, so we do not steal its lock on the global heuristic alone. The acquire path'sremove-if-unchangedrecovery still skips a lock whose file has changed (e.g. a concurrent release), so only a genuinely stalled holder loses the lock.src/agents/session-write-lock.ts— drop"hold-exceeded"fromREPORT_ONLY_STALE_LOCK_REASONS(one element + an explanatory comment).src/agents/session-write-lock.test.ts— add an integration regression: a live OpenClaw-owned lock past its ownmaxHoldMs(but withinstaleMs) is now reclaimed byacquireSessionWriteLock.src/config/schema.help.tsanddocs/reference/session-management-compaction.md— document thatsession.writeLock.maxHoldMsis the hard hold limit enforced both by the holder's watchdog and by a contending reclaim (help string + reference prose only)."too-old"stays report-only, so a live holder past the globalstaleMsbut within its ownmaxHoldMsis still respected (covered by the existing "reports live OpenClaw-owned stale locks without removing them" test).cleanStaleLockFilesdoctor sweep does not passrespectMaxHoldand therefore never produceshold-exceeded; its behavior is unchanged (it still force-removes only dead/recycled/orphan locks, never a live holder's).session.writeLock.maxHoldMshelp string and its reference doc, updated to match the behavior. No plugin SDK surface, no dependency changes. The watchdog interval, timeout,staleMs, andmaxHoldMsresolution are untouched.Reproduction
maxHoldMs.maxHoldMsbut cannot reclaim it.session file lockederror; the lock persists until the gateway restarts or the file is deleted manually.Deterministic source reproduction (added as a regression test): a live OpenClaw-owned lock file with
maxHoldMs: 1000andcreatedAt30s ago, acquired with a largestaleMs. Before the fix the acquire throwsSessionWriteLockStaleError: session file lock stale (hold-exceeded): … alive=true; after the fix it reclaims the lock.Real behavior proof
Behavior addressed (#87483 — a stalled live holder's lock is reclaimable): a session write-lock held past the holder's own
maxHoldMsby a live OpenClaw process is reclaimed by a contender instead of pinning the session indefinitely.Real environment tested (Linux, Node — Vitest against the production
acquireSessionWriteLockwith a real spawned OpenClaw-recognized owner process; no live gateway): verified viasession-write-lock.test.ts, plus core type-check, lint, and format on the changed files. A live overnight gateway lockup was not reproduced.Exact steps or command run after this patch (repo-root):
node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts;node scripts/run-tsgo.mjs -p tsconfig.core.json;oxfmt --checkandoxlinton the changed files.Evidence after fix (Vitest output for the touched test file):
The new case fails on current
mainexactly as the issue describes —SessionWriteLockStaleError: session file lock stale (hold-exceeded): pid=… alive=true ageMs=30005— and passes after dropping"hold-exceeded"from the report-only set, while all 53 existing cases (including the conservative "report live too-old locks without removing them") stay green.Observed result after fix (overdue live holder reclaimed; conservative live holder respected): a live holder past its own
maxHoldMsis reclaimed; a live holder merely past the globalstaleMs(within itsmaxHoldMs) is still only reported, not removed.What was not tested (live overnight gateway lockup): the multi-hour real-gateway lock retention was not reproduced live; the proof is the deterministic module-level reproduction.
Risk / Mitigation
maxHoldMsby its own in-process watchdog (runLockWatchdogCheckforce-releases without aborting in-flight work), somaxHoldMswas already the hard hold limit before this change; the cross-process reclaim only restores that limit for the wedged case where the watchdog cannot fire.remove-if-unchangedadditionally skips a lock whose file changed.staleMsheuristic; it does not touch thresholds, the watchdog, or the doctor sweep, and the conservative live-too-oldbehavior is preserved (covered by an existing test).Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Fixes #87483