Summary
Gateway startup measures sidecars.acp.identity-reconcile at >450 seconds (7.5 minutes) and sidecars.session-locks at >463 seconds (7.7 minutes) on installs with a large number of persisted ACP sessions. During this window the diagnostic emits liveness warning: ... eventLoopDelayP99Ms=360240 eventLoopDelayMaxMs=360240 eventLoopUtilization=1 — the gateway event loop is effectively non-responsive for ~6 minutes. New prompts, channel events, and active session work all queue behind this serial startup work.
The proposed direction: bound startup-phase work with explicit concurrency caps, move per-session reconcile off the awaited startup path (background worker / queued), and add a SLO on sidecars.acp.identity-reconcile and sidecars.session-locks so wall-time growth is alerted before it saturates the event loop.
Environment
- OpenClaw
2026.5.20 (e510042) — npm install at ~/.local/lib/node_modules/openclaw
- Node 25.8.1, macOS 25.3.0 (arm64)
- Install with ~90 persisted ACP sessions across
main/ops/ashley/indexer/chatgpt/chief-of-staff/agents-orchestrator agents
- ACP backend:
acpx; runtime: codex + claude
Reproduction
Approximate (full reproduction needs the same session fanout as our install):
- Install OpenClaw on a host with
acp.enabled: true and accumulate many ACP sessions across multiple agents (~50+).
- Restart the gateway (
launchctl kickstart -k gui/$(id -u)/ai.openclaw.gateway).
- Within minutes of startup, observe
liveness warning events with recentPhases=sidecars.acp.identity-reconcile:NNN ms,…sidecars.session-locks:MMM ms where NNN, MMM are 100s of seconds and eventLoopDelayP99Ms ≥ 30s.
Error / log evidence
From /Users/agent/.openclaw/logs/gateway.log.20260522-180005, 2026-05-22 15:21:41 ICT:
liveness warning: reasons=event_loop_delay,event_loop_utilization
interval=364s eventLoopDelayP99Ms=360240.4 eventLoopDelayMaxMs=360240.4
eventLoopUtilization=1 cpuCoreRatio=0.71
active=1 waiting=0 queued=0
recentPhases=sidecars.acp.identity-reconcile:450139ms,
channels.discord.is-configured:0ms,
channels.discord.runtime:0ms,
channels.discord.approval-bootstrap:0ms,
channels.discord.start-account-handoff:4ms,
sidecars.session-locks:463310ms
work=[active=agent:ops:discord:channel:...(processing/tool_call,q=1,age=361s last=tool:bash:started)]
The diagnostic-phase tracker (diagnostic-phase-D1Ieo0f0.js) records these as wall-clock durations of withDiagnosticPhase / measureStartup calls — the IIFE genuinely took 450 s to settle. The matching acp startup identity reconcile (renderer=...): checked=0 resolved=0 failed=0 summary line is absent (logged only if checked > 0), so the 450 s is upstream of the inner loop — almost certainly the chained dynamic import() + getAcpSessionManager().reconcilePendingSessionIdentities() + the in-loop withSessionActor() awaits, all blocked behind the saturated event loop while other work runs.
Root cause
src/gateway/server-startup-post-attach.ts:520-540:
if (params.cfg.acp?.enabled) {
void (async () => {
await waitForAcpRuntimeBackendReady({ backendId: params.cfg.acp?.backend });
const [{ getAcpSessionManager }, { ACP_SESSION_IDENTITY_RENDERER_VERSION }] =
await Promise.all([
import("../acp/control-plane/manager.js"),
import("../acp/runtime/session-identifiers.js"),
]);
const result = await getAcpSessionManager().reconcilePendingSessionIdentities({
cfg: params.cfg,
});
// …
})().catch(/* swallow */);
}
reconcilePendingSessionIdentities (src/acp/control-plane/manager.core.ts:230-296) is a sequential for ... of acpSessions loop, each iteration await-ing withSessionActor() → ensureRuntimeHandle() → reconcileRuntimeSessionIdentifiers() → writeSessionMeta():
for (const session of acpSessions) {
// …
await this.withSessionActor(session.sessionKey, async () => {
const { runtime, handle, meta } = await this.ensureRuntimeHandle({…});
const reconciled = await this.reconcileRuntimeSessionIdentifiers({…});
// …
});
}
Each iteration can do runtime status fetch + disk I/O. With ~50-90 sessions and a contended event loop (other agents actively running tools at the same time), the wall-clock blows out to multiple minutes.
sidecars.session-locks (server-startup-post-attach.ts:550-584) is similarly a sequential for (const sessionsDir of sessionDirs) { await cleanStaleLockFiles({…}) } over all per-agent session directories. Same shape; same growth profile.
Suggested fix
-
Bound concurrency in reconcilePendingSessionIdentities with an explicit limit (e.g. 4-8 parallel withSessionActor calls), so wall-clock scales sublinearly with session count:
const CONCURRENCY = 8;
const queue = [...acpSessions];
await Promise.all(
Array.from({ length: CONCURRENCY }, async () => {
while (queue.length) {
const session = queue.shift();
if (!session) return;
// existing per-session work here
}
}),
);
-
Yield to the event loop between iterations even in sequential mode, using setImmediate() or await scheduler.yield() (Node 22+). This lets other work (Discord events, in-flight tool calls) progress during the reconcile sweep.
-
Move to a true background worker thread for the disk-scan + runtime-status portions, communicating back via the existing diagnostic event bus. The control-plane state mutation still has to land on the main loop, but the I/O does not.
-
Add SLO/warning when sidecars.acp.identity-reconcile or sidecars.session-locks durations exceed e.g. 30 s, with the count of sessions inspected attached. A warning that fires before saturation is more useful than a liveness warning that fires during saturation.
Workaround
- Prune terminal ACP session metadata so
reconcilePendingSessionIdentities has less to inspect (related: #82414, #72013).
- Restart the gateway at low-traffic hours so the 7-minute event-loop bubble doesn't queue user prompts.
Severity
P2 — gateway is functionally unresponsive for several minutes after restart on installs with normal accumulated session history. Stacks badly with #84076/#82640-class incidents: a wedged bash tool call from one session sits in recovery=none while the event loop is busy reconciling identities for unrelated sessions. Also surfaces as "going dark" / "did not respond" complaints from users typing during the window.
Related
#72013 — ACP startup identity reconcile warns on terminal one-shot sessions (open; about noise, not about wall time)
#82414 — reconcilePendingSessionIdentities counts vanished-backer sessions as "failed" indefinitely; no prune path (closed)
#40566 — ACP startup: identity reconcile runs before acpx backend ready (closed)
#73655 — Gateway leak triad on plugin restart: Manifest EADDRINUSE retry loop, signal-handler accumulation, sync I/O on session JSONL → WS handshake starvation (closed; same family of event-loop-starvation root causes)
#78402 — Gateway repeatedly closes connections (1000/1005/1006) due to event-loop starvation caused by stuck tool call (closed; symptomatic, different root)
Summary
Gateway startup measures
sidecars.acp.identity-reconcileat >450 seconds (7.5 minutes) andsidecars.session-locksat >463 seconds (7.7 minutes) on installs with a large number of persisted ACP sessions. During this window the diagnostic emitsliveness warning: ... eventLoopDelayP99Ms=360240 eventLoopDelayMaxMs=360240 eventLoopUtilization=1— the gateway event loop is effectively non-responsive for ~6 minutes. New prompts, channel events, and active session work all queue behind this serial startup work.The proposed direction: bound startup-phase work with explicit concurrency caps, move per-session reconcile off the awaited startup path (background worker / queued), and add a SLO on
sidecars.acp.identity-reconcileandsidecars.session-locksso wall-time growth is alerted before it saturates the event loop.Environment
2026.5.20(e510042) — npm install at~/.local/lib/node_modules/openclawmain/ops/ashley/indexer/chatgpt/chief-of-staff/agents-orchestratoragentsacpx; runtime:codex+claudeReproduction
Approximate (full reproduction needs the same session fanout as our install):
acp.enabled: trueand accumulate many ACP sessions across multiple agents (~50+).launchctl kickstart -k gui/$(id -u)/ai.openclaw.gateway).liveness warningevents withrecentPhases=sidecars.acp.identity-reconcile:NNN ms,…sidecars.session-locks:MMM mswhere NNN, MMM are 100s of seconds andeventLoopDelayP99Ms≥ 30s.Error / log evidence
From
/Users/agent/.openclaw/logs/gateway.log.20260522-180005, 2026-05-22 15:21:41 ICT:The diagnostic-phase tracker (
diagnostic-phase-D1Ieo0f0.js) records these as wall-clock durations ofwithDiagnosticPhase/measureStartupcalls — the IIFE genuinely took 450 s to settle. The matchingacp startup identity reconcile (renderer=...): checked=0 resolved=0 failed=0summary line is absent (logged only ifchecked > 0), so the 450 s is upstream of the inner loop — almost certainly the chained dynamicimport()+getAcpSessionManager().reconcilePendingSessionIdentities()+ the in-loopwithSessionActor()awaits, all blocked behind the saturated event loop while other work runs.Root cause
src/gateway/server-startup-post-attach.ts:520-540:reconcilePendingSessionIdentities(src/acp/control-plane/manager.core.ts:230-296) is a sequentialfor ... of acpSessionsloop, each iterationawait-ingwithSessionActor()→ensureRuntimeHandle()→reconcileRuntimeSessionIdentifiers()→writeSessionMeta():Each iteration can do runtime status fetch + disk I/O. With ~50-90 sessions and a contended event loop (other agents actively running tools at the same time), the wall-clock blows out to multiple minutes.
sidecars.session-locks(server-startup-post-attach.ts:550-584) is similarly a sequentialfor (const sessionsDir of sessionDirs) { await cleanStaleLockFiles({…}) }over all per-agent session directories. Same shape; same growth profile.Suggested fix
Bound concurrency in
reconcilePendingSessionIdentitieswith an explicit limit (e.g. 4-8 parallelwithSessionActorcalls), so wall-clock scales sublinearly with session count:Yield to the event loop between iterations even in sequential mode, using
setImmediate()orawait scheduler.yield()(Node 22+). This lets other work (Discord events, in-flight tool calls) progress during the reconcile sweep.Move to a true background worker thread for the disk-scan + runtime-status portions, communicating back via the existing diagnostic event bus. The control-plane state mutation still has to land on the main loop, but the I/O does not.
Add SLO/warning when
sidecars.acp.identity-reconcileorsidecars.session-locksdurations exceed e.g. 30 s, with the count of sessions inspected attached. A warning that fires before saturation is more useful than aliveness warningthat fires during saturation.Workaround
reconcilePendingSessionIdentitieshas less to inspect (related:#82414,#72013).Severity
P2 — gateway is functionally unresponsive for several minutes after restart on installs with normal accumulated session history. Stacks badly with
#84076/#82640-class incidents: a wedged bash tool call from one session sits inrecovery=nonewhile the event loop is busy reconciling identities for unrelated sessions. Also surfaces as "going dark" / "did not respond" complaints from users typing during the window.Related
#72013— ACP startup identity reconcile warns on terminal one-shot sessions (open; about noise, not about wall time)#82414— reconcilePendingSessionIdentities counts vanished-backer sessions as "failed" indefinitely; no prune path (closed)#40566— ACP startup: identity reconcile runs before acpx backend ready (closed)#73655— Gateway leak triad on plugin restart: Manifest EADDRINUSE retry loop, signal-handler accumulation, sync I/O on session JSONL → WS handshake starvation (closed; same family of event-loop-starvation root causes)#78402— Gateway repeatedly closes connections (1000/1005/1006) due to event-loop starvation caused by stuck tool call (closed; symptomatic, different root)