Gateway blocks event loop 30s+ per message; bundled runtime deps re-stage every startup, manifest never persists

# Gateway blocks event loop 30s+ per message; bundled runtime deps re-stage every startup, manifest never persists

## Summary

On a clean Linux install (Ubuntu / Linux 6.8, Node 22.22.2), `openclaw gateway run` blocks the Node event loop for 30+ seconds per inbound message. Reproduced across **2026.4.15, 2026.4.24, 2026.4.25, 2026.4.26, 2026.4.27, 2026.4.29**. The bundled runtime deps re-stage every gateway startup *and* every message. The manifest never persists into a form that prevents the next staging.

The symptom for end users on Telegram: message → "typing…" → stops → starts again → reply, with 30 s – 3 min latency. Confirmed gateway-wide (Telegram + MS Teams aspire-help bot both slow).

## Cleanest repro (no host state, no plugins enabled)

A brand-new `--dev` profile on a clean droplet:

```bash
$ openclaw --version
OpenClaw 2026.4.29 (a448042)

$ ls /root/.openclaw-dev    # confirms no prior state
ls: cannot access '/root/.openclaw-dev': No such file or directory

$ openclaw --dev gateway run
2026-05-01T04:38:58 [gateway] loading configuration…
2026-05-01T04:38:58 [gateway] starting...
2026-05-01T04:39:05 [gateway] [plugins] staging bundled runtime deps before gateway startup (35 specs): ...
2026-05-01T04:40:07 [gateway] [plugins] installed bundled runtime deps before gateway startup in 61827ms
2026-05-01T04:40:07 [plugins] acpx staging bundled runtime deps (42 specs): ...
2026-05-01T04:40:18 [plugins] acpx installed bundled runtime deps in 10657ms
2026-05-01T04:40:35 [diagnostic] liveness warning: reasons=event_loop_delay
                    eventLoopDelayP99Ms=1370.5 eventLoopDelayMaxMs=26659
                    eventLoopUtilization=0.919
2026-05-01T04:40:35 [gateway] http server listening (8 plugins; 96.5s)
2026-05-01T04:40:36 [gateway] ready
```

**96.5 seconds to "ready" with 8 default plugins, no user config, no Telegram/MSTeams.** A 26-second event-loop block during the staging phase.

## Production trace (one Telegram message)

Embedded fallback trace from a single inbound agent run with 5 agents configured (Anthropic + Google + Telegram + MSTeams + memory-core):

```
[agent/embedded] [trace:embedded-run] startup stages: phase=attempt-dispatch
  totalMs=13904 stages=
    workspace:2ms,
    runtime-plugins:3498ms,
    hooks:2ms,
    model-resolution:2621ms,
    auth:3886ms,
    context-engine:3ms,
    attempt-dispatch:3891ms

[agent/embedded] [trace:embedded-run] prep stages: phase=stream-ready
  totalMs=39033 stages=
    workspace-sandbox:33ms,
    skills:1ms,
    core-plugin-tools:16352ms,    ← per-message tool re-load
    bootstrap-context:85ms,
    bundle-tools:1374ms,
    system-prompt:7277ms,
    session-resource-loader:7536ms,
    agent-session:7ms,
    stream-setup:6368ms

Concurrent: liveness warning eventLoopDelayMaxMs=39795.6, utilization=1.0
```

`core-plugin-tools` ≈ 16 s every message, regardless of whether the system prompt is 36 KB or 0.4 KB (verified by stripping all per-agent context files to 44-byte stubs and re-running — total prep changed by under 5%).

## Why this is upstream, not host state

We ran a controlled diagnostic to rule out host-state corruption:

| Variable | Production `~/.openclaw/` | Fresh `~/.openclaw-dev/` |
|---|---|---|
| Existing state | 5.7 GB, 6 stacked versions | None — directory did not exist |
| Custom config | 5 agents, 4 channels, 3 MCP servers | None (wizard defaults) |
| Plugins enabled | 38 specs across anthropic/google/telegram/msteams/memory-core | 35 specs default + 8 plugins |
| Time to "ready" | n/a (always running) | **96.5 s, single 26 s loop block** |
| Per-message `core-plugin-tools` | 16 s | not measurable without agent (no auth) |

A clean profile reproduces the same multi-tens-of-seconds runtime-deps install at startup. The bug is not host-state-dependent.

## Root-cause hypothesis (CPU profile evidence)

Live V8 CPU profile, 181 s window during a confirmed stall (file: `cpu-profile-live-2026-04-30T04-09-35-052Z.cpuprofile`, attachable):

By total wall time:
- `loadOpenClawPlugins` — 96.9 s (53.6 %)
- `withBundledRuntimeDepsFilesystemLock` — 53.2 s (29.4 %)
- `ensureBundledPluginRuntimeDeps` — 36.7 s (20.3 %)
- `withBundledRuntimeDepsInstallRootLock` — 35.2 s (19.4 %)

By self time:
- `child_process.spawn` — 30.2 s (16.7 % of busy CPU) — actual `npm install` subprocess invocations.
- `normalizePluginLoaderAliasMapForJiti` (`dist/sdk-alias-DIhpBBl1.js:320`) — 24 % of all samples; called from `getCachedPluginJitiLoader` (`dist/bundled-plugin-metadata-VxOxTVqO.js:99`). The "cached" function is called per-message and re-normalizes the alias map each time — V8 NameDictionary representation, dictionary key sort + per-entry `path.resolve()`. Hot V8 symbols around it confirm: `EnumIndexComparator<NameDictionary>`, `GetOwnEnumPropertyDictionaryKeys`, `Builtins_ForInFilter`, `String::WriteToFlat`. Plus 21 % in `ConcurrentMarking::RunMajor` (V8 GC) — secondary symptom of allocation churn.

## The cycle that doesn't break

1. Plugin requests a runtime dep not in `/root/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/.openclaw-runtime-deps.json` (the static manifest).
2. Gateway logs `staging bundled runtime deps... N missing`.
3. `npm install --ignore-scripts <missingSpecs>` runs in `installExecutionRoot`.
4. Install reports success in NN ms — but specs never make it back to the on-disk manifest.
5. Additionally, `pruneRetainedRuntimeDepsManifestSpecs` deletes anything in `node_modules/` not in the manifest.
6. Next startup / next message → re-detect "missing" → respawn npm → goto 1.

Note: in **2026.4.29**, `/root/.openclaw/plugin-runtime-deps/openclaw-2026.4.29-<hash>/.openclaw-runtime-deps.json` does not exist at all — older versions had it (e.g. 2026.4.27 had 16 baked-in specs). Either the manifest format changed and the writer stopped emitting it, or it's newly absent. Plugins still want 38 specs, so 100 % are treated as missing on every cycle.

## Things ruled out

- **Memory cgroup / swap**: raising `MemoryHigh` 1G→3G eliminated 337K throttle events and 972 MB swap usage but did not change the symptom.
- **Trajectory bloat**: archived a 9.3 MB Telegram trajectory; no improvement.
- **Auth flow / network**: stalls reproduce without an outbound model call (during gateway boot itself).
- **Droplet sizing**: DigitalOcean Basic Regular 4GB / 2vCPU NYC3 — RAM headroom plenty.
- **Host filesystem state**: clean `--dev` profile reproduces (above).
- **Per-message context size**: stripping SOUL/MEMORY/TOOLS files from 36 KB to 0.4 KB stubs changed total prep by <5 %.
- **Downgrading**: 2026.4.26 has the same buggy `getCachedPluginJitiLoader` cache-key shape (`${jitiFilename}::${params.cacheScopeKey ?? cacheKey}`); only function name moved between versions. 2026.4.15 had a simpler-keyed cache that may have hit more often, but the cycle is fundamentally the same.

## Environment

```
OpenClaw      2026.4.29 (a448042)
Node          v22.22.2
Platform      Linux 6.8.0-110-generic (Ubuntu 24.04)
Host          DigitalOcean Basic Regular Intel, 4GB / 2vCPU, NYC3
Filesystem    ext4 on /dev/vda1 (no overlayfs, not WSL/Docker Desktop)
Configured    5 agents (anthropic+google primaries), telegram+msteams channels,
              3 stdio MCP servers, ~30 plugins.allow keys
```

## Evidence (attachable on request)

In `/root/projects/openclaw-perf-evidence-2026-04-30/`:

- `cpu-profile-live-2026-04-30T04-09-35-052Z.cpuprofile` (42 MB) — live V8 profile during a confirmed stall.
- `perf-stall2.data` (2.4 MB) + `perf-153801.map` (V8 JIT symbols) + `perf-report.txt` + `perf-script.txt` — Linux `perf record` capture, full call stacks.
- `analyze-profile.py` + `capture-live-profile.mjs` — capture/analysis scripts used.

## Repro recipe for a maintainer

```bash
# Reproduces the 96 s startup + 26 s loop block on any Linux Node 22 host:
openclaw --dev gateway run
# Watch for: [diagnostic] liveness warning ... eventLoopDelayMaxMs=N (N > 5000)
# Watch for: installed bundled runtime deps before gateway startup in NNNNNms (N > 5000)
```

## Possible directions

1. **Persist the install into the on-disk manifest** so the next staging is a no-op. The 12 ms install times we see indicate `npm install` itself is fine when nothing's missing — the cycle is about the manifest write being lost.
2. **True caching in `getCachedPluginJitiLoader`** — the function name implies a cache, but `${params.cacheScopeKey ?? cacheKey}` with different callers passing different `cacheScopeKey`s effectively bypasses it.
3. **Hoist `ensureBundledPluginRuntimeDeps` out of the per-message path** entirely. It belongs at startup or after a config change, not on every Telegram message.

Happy to capture more profiles, run patched builds, or test a candidate fix on this droplet.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gateway blocks event loop 30s+ per message; bundled runtime deps re-stage every startup, manifest never persists #75437

Gateway blocks event loop 30s+ per message; bundled runtime deps re-stage every startup, manifest never persists

Summary

Cleanest repro (no host state, no plugins enabled)

Production trace (one Telegram message)

Why this is upstream, not host state

Root-cause hypothesis (CPU profile evidence)

The cycle that doesn't break

Things ruled out

Environment

Evidence (attachable on request)

Repro recipe for a maintainer

Possible directions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Variable	Production `~/.openclaw/`	Fresh `~/.openclaw-dev/`
Existing state	5.7 GB, 6 stacked versions	None — directory did not exist
Custom config	5 agents, 4 channels, 3 MCP servers	None (wizard defaults)
Plugins enabled	38 specs across anthropic/google/telegram/msteams/memory-core	35 specs default + 8 plugins
Time to "ready"	n/a (always running)	96.5 s, single 26 s loop block
Per-message `core-plugin-tools`	16 s	not measurable without agent (no auth)

Uh oh!

Gateway blocks event loop 30s+ per message; bundled runtime deps re-stage every startup, manifest never persists #75437

Description

Gateway blocks event loop 30s+ per message; bundled runtime deps re-stage every startup, manifest never persists

Summary

Cleanest repro (no host state, no plugins enabled)

Production trace (one Telegram message)

Why this is upstream, not host state

Root-cause hypothesis (CPU profile evidence)

The cycle that doesn't break

Things ruled out

Environment

Evidence (attachable on request)

Repro recipe for a maintainer

Possible directions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions