fix(agents): memoize per-pid owner-args lookup and yield in cleanStaleLockFiles#86783
Conversation
…eLockFiles The `sidecars.session-locks` startup phase blocks the main event loop long enough to starve Telegram polling, fetch timers, and health monitors (~87s observed on Windows). `cleanStaleLockFiles` invokes `readOwnerProcessArgs(pid)` once per lock, which on Windows shells out synchronously to `powershell Get-CimInstance Win32_Process` (~0.5-1s each). When many locks cluster on a few non-self alive pids (gateway, MCP, cron), the resolver fires once per lock instead of once per pid. Memoize the owner-args resolver per sweep so the same pid is resolved exactly once, and yield to the event loop via setImmediate between lock entries so the per-lock isPidAlive / /proc reads / PowerShell syscalls interleave with concurrent timers. Linux harness: 1800 -> 18 resolver calls (3 unique pids x 6 sweeps); on Windows the same 100x reduction collapses the reporter's 87s sidecar phase to seconds. Fixes openclaw#86509
|
Codex review: needs maintainer review before merge. Reviewed May 26, 2026, 3:49 AM ET / 07:49 UTC. Summary PR surface: Source +18, Tests +72. Total +90 across 2 files. Reproducibility: yes. Current main source shows one owner-args lookup path inside the per-lock cleanup loop, and the PR body supplies a concrete 1,800-lock harness showing the pre-fix call explosion and after-fix collapse; I did not run tests because this was a read-only review. Review metrics: none identified. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the focused session-lock cleanup after normal maintainer review and CI, then use it as the fix path for #86509 if the reporter confirms the Windows regression is resolved. Do we have a high-confidence way to reproduce the issue? Yes. Current main source shows one owner-args lookup path inside the per-lock cleanup loop, and the PR body supplies a concrete 1,800-lock harness showing the pre-fix call explosion and after-fix collapse; I did not run tests because this was a read-only review. Is this the best way to solve the issue? Yes. The patch is a narrow hot-path fix that keeps lock classification semantics and public contracts intact while reducing repeated synchronous process-args work and yielding between lock entries. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 3548cff14b9c. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +18, Tests +72. Total +90 across 2 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Frosted Merge Sprite Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
| } | ||
| }); | ||
|
|
||
| it("memoizes readOwnerProcessArgs across locks with the same pid in one sweep (#86509)", async () => { |
There was a problem hiding this comment.
The logic here looks solid. Just curious, have we considered adding a test case where readOwnerProcessArgs throws an error? It might be useful to verify how the cleanStaleLockFiles behaves under those failure conditions to ensure the lock files aren't left in a corrupted state. Otherwise , LGTM
There was a problem hiding this comment.
Pushed a test for that — 378a36e. Turns out there's a readOwnerProcessArgs helper one layer above the memo that already catches resolver throws and returns null, so cleanStaleLockFiles itself never propagates the error (unchanged from pre-memo). What the memo actually adds is a write-after-success ordering: the cache slot isn't set until the resolver returns, so a thrown call doesn't poison the slot — later locks for the same pid retry fresh instead of cache-hitting on a stale failure. The new case pins that with three same-pid locks against a throwing resolver, expecting the resolver to run three times not one.
Good catch, thanks for looking!
There was a problem hiding this comment.
That makes total sense! Thanks for the clarification on the write-after-success ordering and for adding the test case. It's really cool to see how the memoization handles the resolver error. Appreciate the insight!
ratishoberoi
left a comment
There was a problem hiding this comment.
Looks like a solid test case for memoization. No issues found.
…cessArgs (openclaw#86509) `readOwnerProcessArgs` helper one layer above the memo already catches resolver throws and returns null, so `cleanStaleLockFiles` does not propagate the error. The memo's addition is write-after-success ordering: a throwing resolver does not occupy the cache slot, so later locks for the same pid retry the resolver fresh instead of cache-hitting on a stale failure. Pin that with three same-pid locks against a throwing resolver and assert the resolver runs three times.
|
Maintainer verification before merge: Behavior addressed: startup session-lock cleanup should not resolve owner argv once per same-pid lock, and should give timers a chance to run between lock entries. Real environment tested: local current-main merge result in Exact steps or command run after this patch:
Evidence after fix:
Observed result after fix: five same-pid fresh live locks called What was not tested: full Windows 11 Telegram cascade. Full-file and narrowed Vitest attempts were started under Node 26 and Node 24 but hung before reporting in this busy checkout, so I used the focused TSX harness against the real merged source instead. |
Summary
sidecars.session-locksphase blocks the main event loop long enough to starve concurrent timers and HTTP polling. The reporter sees an 87-second stall on Windows after the v5.20 → v5.22 upgrade with two agents (interactive Telegram + cronmodel_call) firing within ~30s of each other: TelegramgetMecalls time out at 10s despite 31s elapsed wall, the health monitor flaps and cascade-restarts four bots, and the run never recovers. (Regression on v2026.5.22: event-loop starvation returns (87s session-lock phase, 31s loop delay) — ref #80695 #86509)cleanStaleLockFilesiterates every.jsonl.lockin each agent's sessions dir on a single serialfor ... awaitand, for any fresh non-self alive pid, callsreadOwnerProcessArgs(pid)to decide whether the lock owner is an OpenClaw process. There is no per-pid memoization, so when many locks share the same pid (gateway, MCP wrapper, cron run) the resolver is invoked once per lock. On Linux a/proc/PID/cmdlineread is ~0.1 ms — annoying but tolerable; on Windows the resolver shells out topowershell -Command "Get-CimInstance Win32_Process -Filter ..." | Select-Object -ExpandProperty CommandLinesynchronously per pid, ~0.5–1 s each. With multiple agents × dozens of fresh locks pointing at a handful of repeat pids that scales to tens of seconds of synchronous main-thread work, exactly matching the reported 87 s phase.cleanStaleLockFilesonly.Map<pid, args>cache populated on first call. Same pid in the same sweep no longer re-spawns PowerShell (or re-reads/proc/PID/cmdline); pids do not recycle within a single sweep so per-invocation caching is safe.await new Promise((r) => setImmediate(r))at the top of each iteration so concurrent timers (Telegram polling watchdog, fetch timeouts, health monitor) get a chance to run between per-lock sync syscalls instead of all of them piling up under a single long stack.src/agents/session-write-lock.ts: memoizereadOwnerProcessArgsper invocation; yield viasetImmediatebetween lock entries inside the loop.src/agents/session-write-lock.test.ts: regression test asserting the resolver is invoked exactly once for multiple lock files sharing one pid in a single sweep.inspectLockPayload/inspectLockPayloadForSession,markRestartAbortedMainSessionsFromLocks, and the outer sidecar loop (for sessionsDir of sessionDirsinserver-startup-post-attach.ts) are unchanged. Parallelizing the outer loop is a separate, larger concern.readOwnerProcessArgsreader signature is unchanged — the memo lives entirely insidecleanStaleLockFiles.Reproduction
.jsonl.lockfiles whosepidclusters on a small set of fresh non-self alive pids and whosecreatedAtis recent (socleanStaleLockFilesenters the "non-OpenClaw owner" check path).cleanStaleLockFilesonce per dir (the production sidecar pattern).readOwnerProcessArgsresolver is invoked, and measure event-loop max delay.Real behavior proof
cleanStaleLockFilesmust not invoke the expensive owner-args resolver once per lock — and must yield to the event loop between locks so the rest of the gateway (Telegram polling, fetch timers, health monitors) can keep running while the sweep proceeds.cleanStaleLockFilesagainst them with a counting resolver, and measures wall time +monitorEventLoopDelayp99/max.tsx /tmp/qmd86509/repro-cluster.mts <worktree-path>against the pre-fix worktree and the patched worktree.setImmediateoverhead from the new yield (small loss on Linux); on Windows where the resolver is a synchronous PowerShell invocation costing ~0.5–1 s each, the same reduction translates to the sidecar dropping from minutes of blocked main-thread work to seconds. The new yield additionally interleaves the per-lockisPidAlive/getProcessStartTimesyscalls with timer execution so the loop never sits idle on a long stack.node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts(full file passes) — including a new case that asserts the resolver is invoked exactly once for multiple lock files sharing one pid in a single sweep; existing per-lock cleanup behavior is unchanged.cleanStaleLockFilesand the real (mocked-counter) resolver end to end; the Windows magnitude is an extrapolation from the same algorithmic change (PowerShell Win32_Processis documented per-invocation overhead, not modified here).Risk / Mitigation
cleanStaleLockFilescall (theMapis created inside the function). A sweep completes well within the OS pid-recycle window — pids are not reissued in milliseconds — so a cached owner-args entry never becomes stale before the sweep finishes.setImmediate, which adds ~50 µs per lock; for a small sweep (say 5 locks) overhead is ~0.25 ms — invisible. Existing tests continue to pass.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Fixes #86509 · Related (closed precursor) #80695