Skip to content

Gateway blocks event loop 30s+ per message; bundled runtime deps re-stage every startup, manifest never persists #75437

@alvaro-raise

Description

@alvaro-raise

Gateway blocks event loop 30s+ per message; bundled runtime deps re-stage every startup, manifest never persists

Summary

On a clean Linux install (Ubuntu / Linux 6.8, Node 22.22.2), openclaw gateway run blocks the Node event loop for 30+ seconds per inbound message. Reproduced across 2026.4.15, 2026.4.24, 2026.4.25, 2026.4.26, 2026.4.27, 2026.4.29. The bundled runtime deps re-stage every gateway startup and every message. The manifest never persists into a form that prevents the next staging.

The symptom for end users on Telegram: message → "typing…" → stops → starts again → reply, with 30 s – 3 min latency. Confirmed gateway-wide (Telegram + MS Teams aspire-help bot both slow).

Cleanest repro (no host state, no plugins enabled)

A brand-new --dev profile on a clean droplet:

$ openclaw --version
OpenClaw 2026.4.29 (a448042)

$ ls /root/.openclaw-dev    # confirms no prior state
ls: cannot access '/root/.openclaw-dev': No such file or directory

$ openclaw --dev gateway run
2026-05-01T04:38:58 [gateway] loading configuration…
2026-05-01T04:38:58 [gateway] starting...
2026-05-01T04:39:05 [gateway] [plugins] staging bundled runtime deps before gateway startup (35 specs): ...
2026-05-01T04:40:07 [gateway] [plugins] installed bundled runtime deps before gateway startup in 61827ms
2026-05-01T04:40:07 [plugins] acpx staging bundled runtime deps (42 specs): ...
2026-05-01T04:40:18 [plugins] acpx installed bundled runtime deps in 10657ms
2026-05-01T04:40:35 [diagnostic] liveness warning: reasons=event_loop_delay
                    eventLoopDelayP99Ms=1370.5 eventLoopDelayMaxMs=26659
                    eventLoopUtilization=0.919
2026-05-01T04:40:35 [gateway] http server listening (8 plugins; 96.5s)
2026-05-01T04:40:36 [gateway] ready

96.5 seconds to "ready" with 8 default plugins, no user config, no Telegram/MSTeams. A 26-second event-loop block during the staging phase.

Production trace (one Telegram message)

Embedded fallback trace from a single inbound agent run with 5 agents configured (Anthropic + Google + Telegram + MSTeams + memory-core):

[agent/embedded] [trace:embedded-run] startup stages: phase=attempt-dispatch
  totalMs=13904 stages=
    workspace:2ms,
    runtime-plugins:3498ms,
    hooks:2ms,
    model-resolution:2621ms,
    auth:3886ms,
    context-engine:3ms,
    attempt-dispatch:3891ms

[agent/embedded] [trace:embedded-run] prep stages: phase=stream-ready
  totalMs=39033 stages=
    workspace-sandbox:33ms,
    skills:1ms,
    core-plugin-tools:16352ms,    ← per-message tool re-load
    bootstrap-context:85ms,
    bundle-tools:1374ms,
    system-prompt:7277ms,
    session-resource-loader:7536ms,
    agent-session:7ms,
    stream-setup:6368ms

Concurrent: liveness warning eventLoopDelayMaxMs=39795.6, utilization=1.0

core-plugin-tools ≈ 16 s every message, regardless of whether the system prompt is 36 KB or 0.4 KB (verified by stripping all per-agent context files to 44-byte stubs and re-running — total prep changed by under 5%).

Why this is upstream, not host state

We ran a controlled diagnostic to rule out host-state corruption:

Variable Production ~/.openclaw/ Fresh ~/.openclaw-dev/
Existing state 5.7 GB, 6 stacked versions None — directory did not exist
Custom config 5 agents, 4 channels, 3 MCP servers None (wizard defaults)
Plugins enabled 38 specs across anthropic/google/telegram/msteams/memory-core 35 specs default + 8 plugins
Time to "ready" n/a (always running) 96.5 s, single 26 s loop block
Per-message core-plugin-tools 16 s not measurable without agent (no auth)

A clean profile reproduces the same multi-tens-of-seconds runtime-deps install at startup. The bug is not host-state-dependent.

Root-cause hypothesis (CPU profile evidence)

Live V8 CPU profile, 181 s window during a confirmed stall (file: cpu-profile-live-2026-04-30T04-09-35-052Z.cpuprofile, attachable):

By total wall time:

  • loadOpenClawPlugins — 96.9 s (53.6 %)
  • withBundledRuntimeDepsFilesystemLock — 53.2 s (29.4 %)
  • ensureBundledPluginRuntimeDeps — 36.7 s (20.3 %)
  • withBundledRuntimeDepsInstallRootLock — 35.2 s (19.4 %)

By self time:

  • child_process.spawn — 30.2 s (16.7 % of busy CPU) — actual npm install subprocess invocations.
  • normalizePluginLoaderAliasMapForJiti (dist/sdk-alias-DIhpBBl1.js:320) — 24 % of all samples; called from getCachedPluginJitiLoader (dist/bundled-plugin-metadata-VxOxTVqO.js:99). The "cached" function is called per-message and re-normalizes the alias map each time — V8 NameDictionary representation, dictionary key sort + per-entry path.resolve(). Hot V8 symbols around it confirm: EnumIndexComparator<NameDictionary>, GetOwnEnumPropertyDictionaryKeys, Builtins_ForInFilter, String::WriteToFlat. Plus 21 % in ConcurrentMarking::RunMajor (V8 GC) — secondary symptom of allocation churn.

The cycle that doesn't break

  1. Plugin requests a runtime dep not in /root/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/.openclaw-runtime-deps.json (the static manifest).
  2. Gateway logs staging bundled runtime deps... N missing.
  3. npm install --ignore-scripts <missingSpecs> runs in installExecutionRoot.
  4. Install reports success in NN ms — but specs never make it back to the on-disk manifest.
  5. Additionally, pruneRetainedRuntimeDepsManifestSpecs deletes anything in node_modules/ not in the manifest.
  6. Next startup / next message → re-detect "missing" → respawn npm → goto 1.

Note: in 2026.4.29, /root/.openclaw/plugin-runtime-deps/openclaw-2026.4.29-<hash>/.openclaw-runtime-deps.json does not exist at all — older versions had it (e.g. 2026.4.27 had 16 baked-in specs). Either the manifest format changed and the writer stopped emitting it, or it's newly absent. Plugins still want 38 specs, so 100 % are treated as missing on every cycle.

Things ruled out

  • Memory cgroup / swap: raising MemoryHigh 1G→3G eliminated 337K throttle events and 972 MB swap usage but did not change the symptom.
  • Trajectory bloat: archived a 9.3 MB Telegram trajectory; no improvement.
  • Auth flow / network: stalls reproduce without an outbound model call (during gateway boot itself).
  • Droplet sizing: DigitalOcean Basic Regular 4GB / 2vCPU NYC3 — RAM headroom plenty.
  • Host filesystem state: clean --dev profile reproduces (above).
  • Per-message context size: stripping SOUL/MEMORY/TOOLS files from 36 KB to 0.4 KB stubs changed total prep by <5 %.
  • Downgrading: 2026.4.26 has the same buggy getCachedPluginJitiLoader cache-key shape (${jitiFilename}::${params.cacheScopeKey ?? cacheKey}); only function name moved between versions. 2026.4.15 had a simpler-keyed cache that may have hit more often, but the cycle is fundamentally the same.

Environment

OpenClaw      2026.4.29 (a448042)
Node          v22.22.2
Platform      Linux 6.8.0-110-generic (Ubuntu 24.04)
Host          DigitalOcean Basic Regular Intel, 4GB / 2vCPU, NYC3
Filesystem    ext4 on /dev/vda1 (no overlayfs, not WSL/Docker Desktop)
Configured    5 agents (anthropic+google primaries), telegram+msteams channels,
              3 stdio MCP servers, ~30 plugins.allow keys

Evidence (attachable on request)

In /root/projects/openclaw-perf-evidence-2026-04-30/:

  • cpu-profile-live-2026-04-30T04-09-35-052Z.cpuprofile (42 MB) — live V8 profile during a confirmed stall.
  • perf-stall2.data (2.4 MB) + perf-153801.map (V8 JIT symbols) + perf-report.txt + perf-script.txt — Linux perf record capture, full call stacks.
  • analyze-profile.py + capture-live-profile.mjs — capture/analysis scripts used.

Repro recipe for a maintainer

# Reproduces the 96 s startup + 26 s loop block on any Linux Node 22 host:
openclaw --dev gateway run
# Watch for: [diagnostic] liveness warning ... eventLoopDelayMaxMs=N (N > 5000)
# Watch for: installed bundled runtime deps before gateway startup in NNNNNms (N > 5000)

Possible directions

  1. Persist the install into the on-disk manifest so the next staging is a no-op. The 12 ms install times we see indicate npm install itself is fine when nothing's missing — the cycle is about the manifest write being lost.
  2. True caching in getCachedPluginJitiLoader — the function name implies a cache, but ${params.cacheScopeKey ?? cacheKey} with different callers passing different cacheScopeKeys effectively bypasses it.
  3. Hoist ensureBundledPluginRuntimeDeps out of the per-message path entirely. It belongs at startup or after a config change, not on every Telegram message.

Happy to capture more profiles, run patched builds, or test a candidate fix on this droplet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions