Skip to content

fix(gateway): bound startup sidecar fanout#85399

Open
NianJiuZst wants to merge 1 commit into
openclaw:mainfrom
NianJiuZst:codex/fix-85366-startup-sidecars
Open

fix(gateway): bound startup sidecar fanout#85399
NianJiuZst wants to merge 1 commit into
openclaw:mainfrom
NianJiuZst:codex/fix-85366-startup-sidecars

Conversation

@NianJiuZst

Copy link
Copy Markdown
Contributor

Summary

Fixes #85366 by moving the two startup sidecar fanout paths from fully serial sweeps to bounded, yielding background work:

  • ACP startup identity reconciliation now runs pending session work through runTasksWithConcurrency with a small concurrency cap and an event-loop yield before each checked session.
  • Startup stale session-lock cleanup now scans agent session directories through a bounded fanout and preserves restart-aborted marking for cleaned locks.
  • Per-directory stale lock file inspection now uses bounded concurrency, yields before each lock inspection, and still propagates cleanup errors instead of silently dropping them.

Why

Large installs can accumulate dozens of persisted ACP sessions and many agent session directories. The old startup sidecars awaited each ACP session and session-lock directory one at a time, which matched the issue report's 450-460 second diagnostic phases and event-loop starvation warnings. This keeps existing ownership and metadata contracts intact while letting the gateway service other work during the sweep.

Verification

  • node scripts/run-vitest.mjs src/acp/control-plane/manager.test.ts
  • node scripts/run-vitest.mjs src/gateway/server-startup-post-attach.test.ts
  • node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts
  • node scripts/run-oxlint.mjs src/acp/control-plane/manager.core.ts src/acp/control-plane/manager.test.ts src/agents/session-write-lock.ts src/agents/session-write-lock.test.ts src/gateway/server-startup-post-attach.ts src/gateway/server-startup-post-attach.test.ts
  • node scripts/github/real-behavior-proof-check.mjs with a local pull_request event containing this PR body

Real behavior proof

Behavior addressed: Gateway startup sidecars for ACP identity reconciliation and stale session-lock cleanup no longer process large session fanout as a single serial sweep that monopolizes startup work.

Real environment tested: Local OpenClaw source checkout on macOS with Node 24.14.1, using repository startup-sidecar and session-lock code paths plus synthetic multi-session and multi-directory fanout fixtures.

Exact steps or command run after this patch: Ran node scripts/run-vitest.mjs src/acp/control-plane/manager.test.ts, node scripts/run-vitest.mjs src/gateway/server-startup-post-attach.test.ts, and node scripts/run-vitest.mjs src/agents/session-write-lock.test.ts; also ran node scripts/github/real-behavior-proof-check.mjs against this PR body via a local pull_request event file.

Evidence after fix: Terminal output from the local checkout showed src/acp/control-plane/manager.test.ts passed 81 tests, src/gateway/server-startup-post-attach.test.ts passed 38 tests, and src/agents/session-write-lock.test.ts passed 70 tests. The added fanout cases observed concurrent work greater than 1 and bounded by the configured cap while using setImmediate yields.

Observed result after fix: The ACP startup reconcile fixture resolved 9 pending sessions with peak concurrency above 1 and no more than 4; the gateway startup session-lock fixture cleaned 9 agent session directories with bounded fanout; the stale lock fixture cleaned 9 lock files in deterministic order with bounded per-directory fanout.

What was not tested: I did not recreate the reporter's exact 50-90 session production install or the original 450-second wall-clock startup log locally.

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling size: L proof: supplied External PR includes structured after-fix real behavior proof. labels May 22, 2026
@clawsweeper

clawsweeper Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Latest ClawSweeper review: 2026-05-22 14:38 UTC / May 22, 2026, 10:38 AM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR bounds ACP startup identity reconciliation, stale session-lock directory cleanup, and per-directory lock inspection with small concurrency caps and event-loop yields, with focused regression tests and a changelog entry.

Reproducibility: yes. from source and linked logs: current main still serially awaits ACP identity reconciliation and session-lock cleanup loops matching the reported diagnostic phases. I did not run a live 50-90 session gateway reproduction.

PR rating
Overall: 🦐 gold shrimp
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Summary: The patch is focused and well-covered by synthetic tests, but mock-only proof keeps the overall rating below merge-ready.

Rank-up moves:

  • Add redacted after-fix startup/log or terminal proof from a real OpenClaw gateway run outside Vitest, preferably showing ACP/session-lock fanout no longer monopolizes startup work.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: Needs real behavior proof before merge: the PR body only lists tests/mocks and synthetic fixture output, so the contributor should add redacted terminal logs, copied live output, or a recording from an after-fix OpenClaw gateway startup run; after updating the PR body, ClawSweeper should re-review automatically or a maintainer can comment @clawsweeper re-review.

Risk before merge

  • The PR body only proves the fix through Vitest/oxlint and synthetic fixtures, so the external-PR real behavior gate still needs redacted startup/log or terminal output from a real OpenClaw run outside the test harness.
  • The reporter's exact 50-90 session production install and original 450-second wall-clock startup window were not recreated, so live fanout improvement remains unproven even though the source-level repair is plausible.

Maintainer options:

  1. Decide the mitigation before merge
    Land a bounded, yielding startup-sidecar repair after adding real startup/log proof that the gateway remains responsive during ACP/session-lock fanout, while keeping the existing per-session actor and lock-cleanup contracts intact.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge
Contributor action is needed for non-mock after-fix real behavior proof; I did not identify a narrow code repair for ClawSweeper to queue.

Security
Cleared: The diff does not introduce a concrete security or supply-chain concern; it changes bounded runtime fanout plus tests and changelog only.

Review details

Best possible solution:

Land a bounded, yielding startup-sidecar repair after adding real startup/log proof that the gateway remains responsive during ACP/session-lock fanout, while keeping the existing per-session actor and lock-cleanup contracts intact.

Do we have a high-confidence way to reproduce the issue?

Yes from source and linked logs: current main still serially awaits ACP identity reconciliation and session-lock cleanup loops matching the reported diagnostic phases. I did not run a live 50-90 session gateway reproduction.

Is this the best way to solve the issue?

Yes, the implementation direction is the narrow maintainable repair: add bounded concurrency and event-loop yields around the existing startup sidecars while preserving per-session actor serialization and deterministic lock result ordering. It still needs real runtime proof before merge.

Label changes:

  • add P2: The linked bug is a normal-priority gateway startup responsiveness issue with meaningful impact on large ACP installs but a bounded startup-window blast radius.
  • add rating: 🦐 gold shrimp: Current PR rating is 🦐 gold shrimp because proof is 🦪 silver shellfish, patch quality is 🐚 platinum hermit, and The patch is focused and well-covered by synthetic tests, but mock-only proof keeps the overall rating below merge-ready.
  • add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: Needs real behavior proof before merge: the PR body only lists tests/mocks and synthetic fixture output, so the contributor should add redacted terminal logs, copied live output, or a recording from an after-fix OpenClaw gateway startup run; after updating the PR body, ClawSweeper should re-review automatically or a maintainer can comment @clawsweeper re-review.

Label justifications:

  • P2: The linked bug is a normal-priority gateway startup responsiveness issue with meaningful impact on large ACP installs but a bounded startup-window blast radius.
  • rating: 🦐 gold shrimp: Current PR rating is 🦐 gold shrimp because proof is 🦪 silver shellfish, patch quality is 🐚 platinum hermit, and The patch is focused and well-covered by synthetic tests, but mock-only proof keeps the overall rating below merge-ready.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: Needs real behavior proof before merge: the PR body only lists tests/mocks and synthetic fixture output, so the contributor should add redacted terminal logs, copied live output, or a recording from an after-fix OpenClaw gateway startup run; after updating the PR body, ClawSweeper should re-review automatically or a maintainer can comment @clawsweeper re-review.

What I checked:

  • Current ACP startup path is serial on main: Current main still has reconcilePendingSessionIdentities iterate ACP sessions with a serial for ... of loop that awaits withSessionActor, runtime handle resolution, identifier reconciliation, and metadata writes before advancing. (src/acp/control-plane/manager.core.ts:259, d70dc4be1928)
  • Current session-lock startup path is serial on main: Current main resolves agent session directories and awaits cleanStaleLockFiles for each directory one at a time; cleanStaleLockFiles also reads/removes each lock entry in a serial loop. (src/gateway/server-startup-post-attach.ts:653, d70dc4be1928)
  • PR diff applies bounded fanout at the implicated paths: The PR imports runTasksWithConcurrency, adds cap constants of 4, yields with setImmediate, and applies bounded fanout to ACP reconcile, startup session directory cleanup, and per-directory lock inspection while preserving error propagation. (src/gateway/server-startup-post-attach.ts:648, c37c43b13f2f)
  • PR adds focused fanout regression coverage: The diff adds synthetic fanout tests for ACP reconciliation, gateway startup session directory cleanup, and stale lock cleanup, asserting concurrency greater than one and no greater than the configured cap. (src/acp/control-plane/manager.test.ts:3916, c37c43b13f2f)
  • Linked issue has source-level reproduction context: The linked issue reports OpenClaw 2026.5.20 startup diagnostics with sidecars.acp.identity-reconcile and sidecars.session-locks each taking roughly 450-460 seconds, and its ClawSweeper comment already kept it open as source-reproducible against the same serial loops.
  • Real behavior proof is mock/test-only: The PR body's after-fix proof lists Vitest wrapper runs, oxlint, and synthetic fixture observations; it does not include redacted terminal logs, live output, or another artifact from a real OpenClaw gateway startup outside the test harness. (c37c43b13f2f)

Likely related people:

  • osolmaz: The ACP thread-bound agents merge introduced the ACP control-plane manager and startup reconciliation area that this PR changes. (role: ACP feature owner; confidence: medium; commits: a7d56e3554d0; files: src/acp/control-plane/manager.core.ts)
  • Grynn: The stale session lock cleanup/watchdog work appears in the history as the origin of the cleanStaleLockFiles behavior this PR parallelizes. (role: stale-lock cleanup introducer; confidence: medium; commits: e91a5b021648; files: src/agents/session-write-lock.ts, src/gateway/server-startup-post-attach.ts)
  • gumadeiras: The gateway startup/runtime seam split is the nearest major history point for startGatewaySidecars and post-ready sidecar scheduling. (role: gateway startup seam refactor author; confidence: medium; commits: 8de63ca26825; files: src/gateway/server-startup-post-attach.ts)
  • steipete: Peter Steinberger committed the stale-lock cleanup/watchdog change that introduced the lock-cleanup startup behavior in current history. (role: session-lock cleanup committer; confidence: medium; commits: e91a5b021648; files: src/agents/session-write-lock.ts, src/gateway/server-startup-post-attach.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against d70dc4be1928.

@NianJiuZst NianJiuZst force-pushed the codex/fix-85366-startup-sidecars branch from 988e747 to c37c43b Compare May 22, 2026 14:31
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal backlog priority with limited blast radius. labels May 22, 2026
@clawsweeper

clawsweeper Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@BingqingLyu

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling gateway Gateway runtime P2 Normal backlog priority with limited blast radius. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. size: L status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ACP startup sidecars saturate event loop on installs with many sessions — identity-reconcile + session-locks 450-460 s wall, eventLoopDelayP99 6 min

2 participants