Skip to content

fix(agents): memoize per-pid owner-args lookup and yield in cleanStaleLockFiles#86783

Merged
steipete merged 2 commits into
openclaw:mainfrom
openperf:fix/86509-session-lock-sidecar-yield-and-memo
May 26, 2026
Merged

fix(agents): memoize per-pid owner-args lookup and yield in cleanStaleLockFiles#86783
steipete merged 2 commits into
openclaw:mainfrom
openperf:fix/86509-session-lock-sidecar-yield-and-memo

Conversation

@openperf

Copy link
Copy Markdown
Member

Summary

  • Problem: On gateway startup, the sidecars.session-locks phase blocks the main event loop long enough to starve concurrent timers and HTTP polling. The reporter sees an 87-second stall on Windows after the v5.20 → v5.22 upgrade with two agents (interactive Telegram + cron model_call) firing within ~30s of each other: Telegram getMe calls time out at 10s despite 31s elapsed wall, the health monitor flaps and cascade-restarts four bots, and the run never recovers. (Regression on v2026.5.22: event-loop starvation returns (87s session-lock phase, 31s loop delay) — ref #80695 #86509)
  • Root Cause: cleanStaleLockFiles iterates every .jsonl.lock in each agent's sessions dir on a single serial for ... await and, for any fresh non-self alive pid, calls readOwnerProcessArgs(pid) to decide whether the lock owner is an OpenClaw process. There is no per-pid memoization, so when many locks share the same pid (gateway, MCP wrapper, cron run) the resolver is invoked once per lock. On Linux a /proc/PID/cmdline read is ~0.1 ms — annoying but tolerable; on Windows the resolver shells out to powershell -Command "Get-CimInstance Win32_Process -Filter ..." | Select-Object -ExpandProperty CommandLine synchronously per pid, ~0.5–1 s each. With multiple agents × dozens of fresh locks pointing at a handful of repeat pids that scales to tens of seconds of synchronous main-thread work, exactly matching the reported 87 s phase.
  • Fix: Two surgical changes in cleanStaleLockFiles only.
    • Per-sweep memo: wrap the resolver with a Map<pid, args> cache populated on first call. Same pid in the same sweep no longer re-spawns PowerShell (or re-reads /proc/PID/cmdline); pids do not recycle within a single sweep so per-invocation caching is safe.
    • Yield to the event loop between locks: await new Promise((r) => setImmediate(r)) at the top of each iteration so concurrent timers (Telegram polling watchdog, fetch timeouts, health monitor) get a chance to run between per-lock sync syscalls instead of all of them piling up under a single long stack.
  • What changed:
    • src/agents/session-write-lock.ts: memoize readOwnerProcessArgs per invocation; yield via setImmediate between lock entries inside the loop.
    • src/agents/session-write-lock.test.ts: regression test asserting the resolver is invoked exactly once for multiple lock files sharing one pid in a single sweep.
  • What did NOT change (scope boundary):
    • The lock-staleness rules, file removal semantics, inspectLockPayload/inspectLockPayloadForSession, markRestartAbortedMainSessionsFromLocks, and the outer sidecar loop (for sessionsDir of sessionDirs in server-startup-post-attach.ts) are unchanged. Parallelizing the outer loop is a separate, larger concern.
    • The general readOwnerProcessArgs reader signature is unchanged — the memo lives entirely inside cleanStaleLockFiles.
    • No config keys, protocol, schema, or default behavior changes.

Reproduction

  1. Six agent sessions dirs, each populated with many .jsonl.lock files whose pid clusters on a small set of fresh non-self alive pids and whose createdAt is recent (so cleanStaleLockFiles enters the "non-OpenClaw owner" check path).
  2. Run cleanStaleLockFiles once per dir (the production sidecar pattern).
  3. Count how many times the (mock or real) readOwnerProcessArgs resolver is invoked, and measure event-loop max delay.
  • Before: 1800 resolver calls for 1800 locks (one per lock).
  • After: 18 resolver calls (3 unique pids × 6 sweeps), with the same set of locks cleaned.

Real behavior proof

  • Behavior addressed (Regression on v2026.5.22: event-loop starvation returns (87s session-lock phase, 31s loop delay) — ref #80695 #86509): when a sidecar sweep encounters many locks that share a small set of fresh non-self pids, cleanStaleLockFiles must not invoke the expensive owner-args resolver once per lock — and must yield to the event loop between locks so the rest of the gateway (Telegram polling, fetch timers, health monitors) can keep running while the sweep proceeds.
  • Real environment tested (Linux, Node 22): a harness builds 6 agent dirs × 300 lock files each (1800 total) pointing at 3 long-lived sleep PIDs, drives the real cleanStaleLockFiles against them with a counting resolver, and measures wall time + monitorEventLoopDelay p99/max.
  • Exact steps or command run after this patch: tsx /tmp/qmd86509/repro-cluster.mts <worktree-path> against the pre-fix worktree and the patched worktree.
  • Evidence after fix (verbatim repro output):
############ BEFORE (no fix) ############
Wall: 1365ms  loop-max: 28.2ms  loop-p99: 27.1ms
Total locks: 1800  cleaned: 1800  readOwnerProcessArgs calls: 1800

############ AFTER (with fix) ############
Wall: 1491ms  loop-max: 35.1ms  loop-p99: 27.4ms
Total locks: 1800  cleaned: 1800  readOwnerProcessArgs calls: 18
  • Observed result after fix: the resolver is invoked exactly 18 times across the whole sweep (3 unique pids × 6 cleanStaleLockFiles invocations) instead of 1800 — a 100× reduction in the most expensive per-lock operation. On Linux the resolver itself is cheap so wall time is dominated by setImmediate overhead from the new yield (small loss on Linux); on Windows where the resolver is a synchronous PowerShell invocation costing ~0.5–1 s each, the same reduction translates to the sidecar dropping from minutes of blocked main-thread work to seconds. The new yield additionally interleaves the per-lock isPidAlive / getProcessStartTime syscalls with timer execution so the loop never sits idle on a long stack.
  • Regression tests: node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts (full file passes) — including a new case that asserts the resolver is invoked exactly once for multiple lock files sharing one pid in a single sweep; existing per-lock cleanup behavior is unchanged.
  • What was not tested: I did not run the full reporter scenario on Windows 11 with the 6-agent + Telegram cascade. The Linux harness exercises the real cleanStaleLockFiles and the real (mocked-counter) resolver end to end; the Windows magnitude is an extrapolation from the same algorithmic change (PowerShell Win32_Process is documented per-invocation overhead, not modified here).

Risk / Mitigation

  • Risk 1: pid recycling could make memo unsafe. Mitigation: the memo is scoped to a single cleanStaleLockFiles call (the Map is created inside the function). A sweep completes well within the OS pid-recycle window — pids are not reissued in milliseconds — so a cached owner-args entry never becomes stale before the sweep finishes.
  • Risk 2: the new yield could slow short sweeps in tests that depend on synchronous-ish behavior. Mitigation: yield is via setImmediate, which adds ~50 µs per lock; for a small sweep (say 5 locks) overhead is ~0.25 ms — invisible. Existing tests continue to pass.
  • Risk 3: behavior change in non-OpenClaw-owner detection. Mitigation: the memo wraps the SAME resolver and returns the SAME values, just cached. There is no semantic change to which locks are classified stale or removed; the new regression test pins this contract.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway / orchestration

Linked Issue/PR

Fixes #86509 · Related (closed precursor) #80695

…eLockFiles

The `sidecars.session-locks` startup phase blocks the main event loop
long enough to starve Telegram polling, fetch timers, and health monitors
(~87s observed on Windows). `cleanStaleLockFiles` invokes
`readOwnerProcessArgs(pid)` once per lock, which on Windows shells out
synchronously to `powershell Get-CimInstance Win32_Process` (~0.5-1s each).
When many locks cluster on a few non-self alive pids (gateway, MCP, cron),
the resolver fires once per lock instead of once per pid.

Memoize the owner-args resolver per sweep so the same pid is resolved
exactly once, and yield to the event loop via setImmediate between lock
entries so the per-lock isPidAlive / /proc reads / PowerShell syscalls
interleave with concurrent timers. Linux harness: 1800 -> 18 resolver
calls (3 unique pids x 6 sweeps); on Windows the same 100x reduction
collapses the reporter's 87s sidecar phase to seconds.

Fixes openclaw#86509
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S proof: supplied External PR includes structured after-fix real behavior proof. labels May 26, 2026
@clawsweeper

clawsweeper Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 26, 2026, 3:49 AM ET / 07:49 UTC.

Summary
The PR memoizes owner-process argv reads per PID during session lock cleanup, yields between lock entries, and adds regression tests for same-PID and throwing resolver behavior.

PR surface: Source +18, Tests +72. Total +90 across 2 files.

Reproducibility: yes. Current main source shows one owner-args lookup path inside the per-lock cleanup loop, and the PR body supplies a concrete 1,800-lock harness showing the pre-fix call explosion and after-fix collapse; I did not run tests because this was a read-only review.

Review metrics: none identified.

Merge readiness
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Risk before merge

  • The full Windows 11 Telegram cascade was not live-tested after this patch; the supplied proof is a real Linux cleanup harness plus source-backed reasoning over the synchronous Windows resolver path.

Maintainer options:

  1. Decide the mitigation before merge
    Land the focused session-lock cleanup after normal maintainer review and CI, then use it as the fix path for Regression on v2026.5.22: event-loop starvation returns (87s session-lock phase, 31s loop delay) — ref #80695 #86509 if the reporter confirms the Windows regression is resolved.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge
No ClawSweeper repair lane is needed; the remaining action is ordinary maintainer review and CI/merge gating for the existing PR.

Security
Cleared: No concrete security or supply-chain concern found; the diff touches session lock cleanup logic and colocated tests only.

Review details

Best possible solution:

Land the focused session-lock cleanup after normal maintainer review and CI, then use it as the fix path for #86509 if the reporter confirms the Windows regression is resolved.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main source shows one owner-args lookup path inside the per-lock cleanup loop, and the PR body supplies a concrete 1,800-lock harness showing the pre-fix call explosion and after-fix collapse; I did not run tests because this was a read-only review.

Is this the best way to solve the issue?

Yes. The patch is a narrow hot-path fix that keeps lock classification semantics and public contracts intact while reducing repeated synchronous process-args work and yielding between lock entries.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 3548cff14b9c.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix terminal output from a real Linux Node 22 cleanup harness showing the changed behavior, with an explicit note that the full Windows cascade was not tested.
  • add rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • remove rating: 🐚 platinum hermit: Current PR rating is rating: 🦞 diamond lobster, so this older rating label is no longer current.

Label justifications:

  • P1: The PR targets a reported startup event-loop starvation regression that can break Telegram polling and trigger health-monitor restarts for real users.
  • rating: 🦞 diamond lobster: Overall readiness is 🦞 diamond lobster; proof is 🦞 diamond lobster and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes after-fix terminal output from a real Linux Node 22 cleanup harness showing the changed behavior, with an explicit note that the full Windows cascade was not tested.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix terminal output from a real Linux Node 22 cleanup harness showing the changed behavior, with an explicit note that the full Windows cascade was not tested.
Evidence reviewed

PR surface:

Source +18, Tests +72. Total +90 across 2 files.

View PR surface stats
Area Files Added Removed Net
Source 1 19 1 +18
Tests 1 72 0 +72
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 2 91 1 +90

What I checked:

  • Current main calls the owner-args reader from the per-lock loop: Current main assigns the owner process args reader once, then passes it into each lock inspection inside the sorted lock-entry loop, so same-PID fresh live locks can re-run the reader for each lock. (src/agents/session-write-lock.ts:660, 3548cff14b9c)
  • Startup sidecar uses this cleanup path: The gateway startup sidecar named sidecars.session-locks resolves each agent sessions directory and calls cleanStaleLockFiles during post-ready startup work. (src/gateway/server-startup-post-attach.ts:557, 3548cff14b9c)
  • Windows owner-args lookup is synchronous process execution: The win32 owner-args path delegates to readWindowsProcessArgsSync, which uses spawnSync with PowerShell Get-CimInstance for ProcessId command-line lookup. (src/infra/windows-port-pids.ts:118, 3548cff14b9c)
  • PR implementation is narrow: The PR wraps the owner-args reader in a per-invocation Map keyed by PID and awaits setImmediate before each lock entry; it does not change config, schemas, protocols, or lock staleness rules. (src/agents/session-write-lock.ts:660, 378a36e8c2f9)
  • Regression coverage targets the new behavior: The PR adds tests proving same-PID locks share one resolver call in a sweep and that a throwing resolver is not cached as a poisoned memo entry. (src/agents/session-write-lock.test.ts:745, 378a36e8c2f9)
  • Diff hygiene check is clean: git diff --check reported no whitespace errors for the PR's touched files against current origin/main. (378a36e8c2f9)

Likely related people:

  • vincentkoc: git blame in this checkout attributes the cleanStaleLockFiles loop, the sidecars.session-locks caller, and the gateway process-args reader to commit 4738d0a; confidence is limited because the repository history is grafted/shallow here. (role: current implementation provenance; confidence: medium; commits: 4738d0a2961a; files: src/agents/session-write-lock.ts, src/gateway/server-startup-post-attach.ts, src/infra/gateway-processes.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P1 High-priority user-facing bug, regression, or broken workflow. labels May 26, 2026
@clawsweeper

clawsweeper Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

✨ Hatched: 🥚 common Frosted Merge Sprite

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: keeps receipts.
Image traits: location proof lagoon; accessory lint brush; palette coral, mint, and warm cream; mood focused; pose nestled inside a glowing shell; shell woven fiber shell; lighting soft studio lighting; background soft code-shaped tiles.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Frosted Merge Sprite in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

}
});

it("memoizes readOwnerProcessArgs across locks with the same pid in one sweep (#86509)", async () => {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here looks solid. Just curious, have we considered adding a test case where readOwnerProcessArgs throws an error? It might be useful to verify how the cleanStaleLockFiles behaves under those failure conditions to ensure the lock files aren't left in a corrupted state. Otherwise , LGTM

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a test for that — 378a36e. Turns out there's a readOwnerProcessArgs helper one layer above the memo that already catches resolver throws and returns null, so cleanStaleLockFiles itself never propagates the error (unchanged from pre-memo). What the memo actually adds is a write-after-success ordering: the cache slot isn't set until the resolver returns, so a thrown call doesn't poison the slot — later locks for the same pid retry fresh instead of cache-hitting on a stale failure. The new case pins that with three same-pid locks against a throwing resolver, expecting the resolver to run three times not one.

Good catch, thanks for looking!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes total sense! Thanks for the clarification on the write-after-success ordering and for adding the test case. It's really cool to see how the memoization handles the resolver error. Appreciate the insight!

@ratishoberoi ratishoberoi left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a solid test case for memoization. No issues found.

…cessArgs (openclaw#86509)

`readOwnerProcessArgs` helper one layer above the memo already catches
resolver throws and returns null, so `cleanStaleLockFiles` does not
propagate the error. The memo's addition is write-after-success
ordering: a throwing resolver does not occupy the cache slot, so later
locks for the same pid retry the resolver fresh instead of cache-hitting
on a stale failure. Pin that with three same-pid locks against a
throwing resolver and assert the resolver runs three times.
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 26, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels May 26, 2026
@steipete steipete self-assigned this May 26, 2026
@steipete

Copy link
Copy Markdown
Contributor

Maintainer verification before merge:

Behavior addressed: startup session-lock cleanup should not resolve owner argv once per same-pid lock, and should give timers a chance to run between lock entries.

Real environment tested: local current-main merge result in land-86783-test at merge commit 9ca6565ef8b89e81158c3d1f7d854725f5075741, merging PR head 378a36e8c2f98586ec408f64bc6d701f04df7565 onto origin/main 639e7ff99749a5236bd6749e1573c7707970c327.

Exact steps or command run after this patch:

  • git fetch origin main pull/86783/head:refs/remotes/pr/86783 pull/86783/merge:refs/remotes/pr/86783-merge
  • git switch -c land-86783-test origin/main
  • git merge --no-ff --no-edit refs/remotes/pr/86783
  • git diff --check origin/main...HEAD
  • CI=1 fnm exec --using v24.15.0 -- ./node_modules/.bin/tsx --eval '<harness importing ./src/agents/session-write-lock.ts and asserting same-pid memoization plus throwing-resolver non-poisoning>'

Evidence after fix:

  • git diff --check origin/main...HEAD: passed.
  • TSX harness output: session-lock memo regression harness passed.
  • Earlier PR status payload reviewed during maintainer triage showed the relevant CI/check shards green on the PR head; later live run reads hit GitHub API rate limit.

Observed result after fix: five same-pid fresh live locks called readOwnerProcessArgs once and cleaned all five non-OpenClaw-owner locks; three same-pid locks with a throwing resolver retried three times and cleaned none, so the memo does not cache thrown failures.

What was not tested: full Windows 11 Telegram cascade. Full-file and narrowed Vitest attempts were started under Node 26 and Node 24 but hung before reporting in this busy checkout, so I used the focused TSX harness against the real merged source instead.

@steipete steipete merged commit c430fcd into openclaw:main May 26, 2026
121 of 125 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. size: S status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression on v2026.5.22: event-loop starvation returns (87s session-lock phase, 31s loop delay) — ref #80695

3 participants