Skip to content

60s startup hang in sidecars.channels — synchronous plugin manifest re-discovery on every cold start (v2026.4.26) #73353

@chsbusch-dot

Description

@chsbusch-dot

Environment

  • OpenClaw: 2026.4.26 (be8c246)
  • Node: v24.15.0
  • OS: Ubuntu 24.04 (6.8.0-110-generic, x86_64)
  • Deployment: openclaw-gateway.service via systemd user unit

Symptom

openclaw gateway start takes ~67s from systemd start to [gateway] ready. The 55-60s window is silent (no logs). App is unusable during this period.

Reproduction

Any cold start of the gateway with channels configured:

time systemctl --user restart openclaw-gateway
# watch journalctl; ~55s gap between "[hooks] loaded" and "[gateway] ready"

Config that triggers it: channels.telegram.enabled = true (though telegram is not the root cause — see bisect below).

Instrumentation

Built-in startup trace (OPENCLAW_GATEWAY_STARTUP_TRACE=1)

Stage Duration eventLoopMax
plugins.bootstrap 2721ms
sidecars.session-locks 4.5ms 0ms
sidecars.gmail-watch 0.1ms 0ms
sidecars.gmail-model 0.2ms 0ms
sidecars.internal-hooks 1882ms 36ms
sidecars.channels 54 829ms 22 029ms
sidecars.plugin-services 379ms 372ms
sidecars.memory 0.1ms 0ms
sidecars.total 57 128ms
ready 1.3ms 0ms

eventLoopMax = 22 029ms means the JS event loop was synchronously blocked for 22 seconds at one point — not a network timeout.

Bisect

Run sidecars.channels eventLoopMax
baseline 54 829ms 22 029ms
OPENCLAW_SKIP_CHANNELS=1 1.9ms 0ms
OPENCLAW_TELEGRAM_DISABLE_AUTO_SELECT_FAMILY=1 + OPENCLAW_TELEGRAM_DNS_RESULT_ORDER=ipv4first 54 456ms 22 029ms
channels.telegram.enabled = false (in config) 54 406ms 21 760ms

Telegram is not the cause. Disabling telegram or any DNS hardening has zero effect. Skipping the entire channels block (OPENCLAW_SKIP_CHANNELS=1) eliminates the hang.

V8 CPU profile (node --prof)

Top JavaScript hot frames (ticks = share of 74s profiling window):

2877 ticks (4.3%)  json5/lib/parse.js *parse
2670 ticks (4.0%)  json5/lib/parse.js *beforePropertyValue
2328 ticks (3.5%)  json5/lib/parse.js *string

Bottom-up callchain through the hot json5 frames:

loadPluginManifest             dist/manifest-DkU_xlZi.js:1166
  ← loadPluginManifestRegistry dist/manifest-registry-CXpW6f0a.js:341  (57.7%)
  ← discoverInDirectory        dist/discovery-CRcfnviq.js:481
      ← loadOpenClawPlugins    dist/loader--FR-1ZCZ.js:2903

Also significant:

  • collectRuntimePackageWildcardImportTargets / isPathInside / boundary-path — synchronous path resolution inside the discovery loop
  • 2275 ticks in node:path resolve driven by boundary checks

Top C++ (syscall view)

Syscall Ticks % of C++
syscall 5639 12.0%
__open 2189 4.7%
access 2126 4.5%
__read 1805 3.8%
getdents64 198 0.4%

Heavy synchronous filesystem walk — opening, statting, and reading many files on the critical path.

Root Cause Hypothesis

sidecars.channels calls prewarmConfiguredPrimaryModel before startChannels(). prewarmConfiguredPrimaryModel calls ensureOpenClawModelsJsongetCurrentPluginMetadataSnapshot → triggers a full plugin manifest discovery walk (the same work plugins.bootstrap already did 50s earlier). Discovery synchronously opens every plugin's package.json/manifest, json5-parses it, and canonicalizes paths — blocking the event loop for ~22s and taking ~55s wall time.

The prewarm is also active even when the primary model (google/gemini-3.1-flash-lite-preview) passes through a non-pi harness. The three early-exit guards (isConfiguredCliBackendPrimary, isCliProvider, selectAgentHarness().id !== "pi") are checked after the 7-module Promise.all import and the discovery-triggering ensureOpenClawModelsJson, so non-pi models still pay the full cost.

Things That Didn't Help

  • OPENCLAW_TELEGRAM_DISABLE_AUTO_SELECT_FAMILY=1 / OPENCLAW_TELEGRAM_DNS_RESULT_ORDER=ipv4first — no effect
  • Disabling the telegram channel in config — no effect
  • Node 24 (upgraded from v22) — no effect

Workaround

Add to the systemd unit:

Environment=OPENCLAW_SKIP_CHANNELS=1

Drops sidecars.channels from 54 829ms to 1.9ms; total cold start goes from ~68s to ~14s. Channels and telegram are disabled.

Suggested Fixes

  1. Re-order gates in prewarmConfiguredPrimaryModel (server.impl-*:8428): check isConfiguredCliBackendPrimary / isCliProvider / selectAgentHarness().id !== "pi" before the Promise.all import block and before calling ensureOpenClawModelsJson. Non-pi providers (google, openai, custom) should return immediately with zero discovery work.

  2. Reuse the plugins.bootstrap snapshot in getCurrentPluginMetadataSnapshot: the full discovery already ran once (2.7s at plugins.bootstrap). The result should be cached in a process-singleton that ensureOpenClawModelsJson reads rather than re-discovering. The MODELS_JSON_STATE.readyCache fingerprint cache is keyed per targetPath, but the underlying plugin metadata scan runs unconditionally on a cache miss.

  3. Break the sync discovery loop: discoverInDirectory + loadPluginManifest pin the event loop for 22s in a tight synchronous loop. Inserting await new Promise(r => setImmediate(r)) between manifest reads, or moving discovery to a worker thread, would allow the rest of startup to interleave and would prevent starving incoming WS connections.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions