fix(gateway): bound startup sidecar fanout#85399
Conversation
|
Codex review: needs real behavior proof before merge. Latest ClawSweeper review: 2026-05-22 14:38 UTC / May 22, 2026, 10:38 AM ET. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. from source and linked logs: current main still serially awaits ACP identity reconciliation and session-lock cleanup loops matching the reported diagnostic phases. I did not run a live 50-90 session gateway reproduction. PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land a bounded, yielding startup-sidecar repair after adding real startup/log proof that the gateway remains responsive during ACP/session-lock fanout, while keeping the existing per-session actor and lock-cleanup contracts intact. Do we have a high-confidence way to reproduce the issue? Yes from source and linked logs: current main still serially awaits ACP identity reconciliation and session-lock cleanup loops matching the reported diagnostic phases. I did not run a live 50-90 session gateway reproduction. Is this the best way to solve the issue? Yes, the implementation direction is the narrow maintainable repair: add bounded concurrency and event-loop yields around the existing startup sidecars while preserving per-session actor serialization and deterministic lock result ordering. It still needs real runtime proof before merge. Label changes:
Label justifications:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against d70dc4be1928. |
988e747 to
c37c43b
Compare
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
Summary
Fixes #85366 by moving the two startup sidecar fanout paths from fully serial sweeps to bounded, yielding background work:
runTasksWithConcurrencywith a small concurrency cap and an event-loop yield before each checked session.Why
Large installs can accumulate dozens of persisted ACP sessions and many agent session directories. The old startup sidecars awaited each ACP session and session-lock directory one at a time, which matched the issue report's 450-460 second diagnostic phases and event-loop starvation warnings. This keeps existing ownership and metadata contracts intact while letting the gateway service other work during the sweep.
Verification
node scripts/run-vitest.mjs src/acp/control-plane/manager.test.tsnode scripts/run-vitest.mjs src/gateway/server-startup-post-attach.test.tsnode scripts/run-vitest.mjs src/agents/session-write-lock.test.tsnode scripts/run-oxlint.mjs src/acp/control-plane/manager.core.ts src/acp/control-plane/manager.test.ts src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts src/gateway/server-startup-post-attach.ts src/gateway/server-startup-post-attach.test.tsnode scripts/github/real-behavior-proof-check.mjswith a local pull_request event containing this PR bodyReal behavior proof
Behavior addressed: Gateway startup sidecars for ACP identity reconciliation and stale session-lock cleanup no longer process large session fanout as a single serial sweep that monopolizes startup work.
Real environment tested: Local OpenClaw source checkout on macOS with Node 24.14.1, using repository startup-sidecar and session-lock code paths plus synthetic multi-session and multi-directory fanout fixtures.
Exact steps or command run after this patch: Ran
node scripts/run-vitest.mjs src/acp/control-plane/manager.test.ts,node scripts/run-vitest.mjs src/gateway/server-startup-post-attach.test.ts, andnode scripts/run-vitest.mjs src/agents/session-write-lock.test.ts; also rannode scripts/github/real-behavior-proof-check.mjsagainst this PR body via a local pull_request event file.Evidence after fix: Terminal output from the local checkout showed
src/acp/control-plane/manager.test.tspassed 81 tests,src/gateway/server-startup-post-attach.test.tspassed 38 tests, andsrc/agents/session-write-lock.test.tspassed 70 tests. The added fanout cases observed concurrent work greater than 1 and bounded by the configured cap while usingsetImmediateyields.Observed result after fix: The ACP startup reconcile fixture resolved 9 pending sessions with peak concurrency above 1 and no more than 4; the gateway startup session-lock fixture cleaned 9 agent session directories with bounded fanout; the stale lock fixture cleaned 9 lock files in deterministic order with bounded per-directory fanout.
What was not tested: I did not recreate the reporter's exact 50-90 session production install or the original 450-second wall-clock startup log locally.