Summary
After upgrading from 2026.4.15 → 2026.4.27, the gateway exhibits ~7-second synchronous event-loop stalls during normal agent activity, blocking Telegram polling and forcing watchdog-triggered container restarts. 24h post-upgrade window: 27 gateway crashes, 25 Tier 2 (full container restart). Pre-upgrade had been operating without this pattern.
.cpuprofile evidence (6 stalls aggregated, 230s combined CPU) shows the dominant hot path is createOpenClawTools / createOpenClawCodingTools invoking loadPluginRegistrySnapshotWithMetadata and resolveProviderAuthEnvVarCandidates synchronously on every tool-creation pass. Top self-time frames are lstat, readFileSync, path.resolve — classic uncached fs scan.
This may share root cause with #73532 (plugin loader hot loop on 4.25/4.26) but differs in entry point and version: that report's hot stack is prepareBundledPluginRuntimeDistMirror (mirror-init); mine is createOpenClawTools (tool-creation per agent invocation). Filing separately so the maintainer can dedup or split as appropriate.
This is distinct from #74345 (ACP session leak / 480s task-registry-maintenance event-loop holds, also on 4.27) — different fingerprint (~7s stalls, no ACP cleanup loop, no session-write-lock warnings).
Environment
- OpenClaw
2026.4.27 (c27ae31043c7) — installed via global npm
- Node
v22.22.1
- Linux x86_64, host-mode container (Docker), gateway port 54401
- Channels enabled: Telegram (multi-bot), WhatsApp
- ~10 user agents bound to lanes via
bindings[]; many tool-call-heavy
Symptom fingerprint
/data/.openclaw/logs/gateway-eventloop.jsonl — continuous ~7s stalls during normal load:
| Timestamp (UTC) |
Stall (ms) |
| 04:17:28Z |
6937 |
| 04:17:43Z |
7130 |
| 04:18:49Z |
7151 |
| 04:19:19Z |
7365 |
| 04:20:34Z |
6942 |
| 04:20:49Z |
7038 |
| 04:23:04Z |
7650 |
ELU peaked at 0.81 across two consecutive samples, with maxMs=6941–7650ms. CPU profiles auto-saved per stall.
gateway-monitor.log — 27 gateway crashes / 24h, 25 Tier 2 (full container restart), concentrated in the post-upgrade window. Watchdog kills gateway, container restarts, load resumes, stalls return.
While stalled: Telegram long-poll stays alive (update-offset-*.json mtime advances), getChat/getChatMember confirm bot-side state is fine, but channel-post-routed agent invocations don't dispatch — the agent isn't getting CPU.
CPU profile analysis (6 stalls aggregated, 230s combined)
Top 5 by inclusive time (caller + descendants):
| Inclusive ms |
% |
Function |
Source |
| 51,637 |
3.2% |
runEmbeddedAttempt |
selection-8xKkwZC_.js:5792 |
| 46,688 |
2.9% |
createOpenClawCodingTools |
pi-tools-DvRlK7Mt.js:805 |
| 46,668 |
2.9% |
createOpenClawTools |
openclaw-tools-CHlPDJlv.js:8767 |
| 32,160 |
2.0% |
loadPluginRegistrySnapshotWithMetadata |
plugin-registry-Bjn9rPwy.js:115 |
| 24,240 |
1.5% |
resolveProviderAuthEnvVarCandidates |
provider-env-vars-DOy0Czuc.js:60 |
Top 5 by self time (where CPU actually burns):
| Self ms |
% |
Function |
| 13,343 |
5.8% |
lstat |
| 4,884 |
2.1% |
post (inspector) |
| 4,090 |
1.8% |
(garbage collector) |
| 3,241 |
1.4% |
path.resolve |
| 3,219 |
1.4% |
readFileSync |
The hot-path pattern is unambiguous: createOpenClawTools + createOpenClawCodingTools running synchronous plugin-registry rebuild + provider-auth-env-var resolution on every tool-creation pass, dominated by lstat / readFileSync / existsSync / readdir.
loadInstalledPluginIndex (22,638ms inclusive) and loadPluginManifest (905ms self) confirm the plugin index is being rebuilt, not cache-served.
resolveProviderAuthEnvVarCandidates → resolveManifestProviderAuthEnvVarCandidates → isProviderApiKeyConfigured → hasAuthForProvider indicates per-tool-call provider-auth resolution that walks process.env against every model provider's possible env var name.
Inferred bug class
In 2026.4.27, the tool-creation path appears to:
- Reload the entire plugin registry from disk synchronously on every invocation rather than serving from the cache that
loadOpenClawPlugins is supposed to provide.
- Re-resolve provider-auth env-var candidates per tool synchronously without caching — this scales with (number of tools × number of providers × number of env-var aliases per provider).
Each pass takes ~7s of CPU time, blocking the event loop. Under any non-trivial agent activity (multiple tool calls per minute), the gateway never recovers and watchdog kills it.
This blocks: Telegram long-poll dispatch, agent response generation, all PM2 worker IPC.
Likely overlaps with #73532's "cache key varies between callers" / "cache invalidated too aggressively" hypotheses, but expressed via the tool-creation entry point rather than the mirror-prep one.
Reproduction
- Install
openclaw@2026.4.27 (or pull c27ae31043c7)
- Configure ≥2 enabled user plugins and ≥2 model providers with auth profiles
- Bind ≥1 Telegram bot to an agent and send agent any tool-call-heavy prompt under steady load (~1–2 invocations/minute)
- Within minutes,
gateway-eventloop.jsonl will show 6000–7700ms stalls per sample window
- Within ~1h, watchdog Tier 2 restart will fire
CPU profiles auto-write to <DATA>/.openclaw/logs/gateway-stall-profiles/<pid>-<ts>/*.cpuprofile. Loading any in Chrome DevTools → Performance shows the flame graph matching the table above.
Suggested next steps
Artifacts available
- 6 .cpuprofile files (10MB total) — happy to attach a redacted bundle on request
gateway-eventloop.jsonl excerpt
gateway-monitor.log crash log
- Internal finding doc with full top-10 inclusive/self frame analysis
Related
Summary
After upgrading from
2026.4.15→2026.4.27, the gateway exhibits ~7-second synchronous event-loop stalls during normal agent activity, blocking Telegram polling and forcing watchdog-triggered container restarts. 24h post-upgrade window: 27 gateway crashes, 25 Tier 2 (full container restart). Pre-upgrade had been operating without this pattern..cpuprofileevidence (6 stalls aggregated, 230s combined CPU) shows the dominant hot path iscreateOpenClawTools/createOpenClawCodingToolsinvokingloadPluginRegistrySnapshotWithMetadataandresolveProviderAuthEnvVarCandidatessynchronously on every tool-creation pass. Top self-time frames arelstat,readFileSync,path.resolve— classic uncached fs scan.This may share root cause with #73532 (plugin loader hot loop on 4.25/4.26) but differs in entry point and version: that report's hot stack is
prepareBundledPluginRuntimeDistMirror(mirror-init); mine iscreateOpenClawTools(tool-creation per agent invocation). Filing separately so the maintainer can dedup or split as appropriate.This is distinct from #74345 (ACP session leak / 480s
task-registry-maintenanceevent-loop holds, also on 4.27) — different fingerprint (~7s stalls, no ACP cleanup loop, no session-write-lock warnings).Environment
2026.4.27(c27ae31043c7) — installed via global npmv22.22.1bindings[]; many tool-call-heavySymptom fingerprint
/data/.openclaw/logs/gateway-eventloop.jsonl— continuous ~7s stalls during normal load:ELU peaked at 0.81 across two consecutive samples, with
maxMs=6941–7650ms. CPU profiles auto-saved per stall.gateway-monitor.log— 27 gateway crashes / 24h, 25 Tier 2 (full container restart), concentrated in the post-upgrade window. Watchdog kills gateway, container restarts, load resumes, stalls return.While stalled: Telegram long-poll stays alive (
update-offset-*.jsonmtime advances),getChat/getChatMemberconfirm bot-side state is fine, but channel-post-routed agent invocations don't dispatch — the agent isn't getting CPU.CPU profile analysis (6 stalls aggregated, 230s combined)
Top 5 by inclusive time (caller + descendants):
runEmbeddedAttemptselection-8xKkwZC_.js:5792createOpenClawCodingToolspi-tools-DvRlK7Mt.js:805createOpenClawToolsopenclaw-tools-CHlPDJlv.js:8767loadPluginRegistrySnapshotWithMetadataplugin-registry-Bjn9rPwy.js:115resolveProviderAuthEnvVarCandidatesprovider-env-vars-DOy0Czuc.js:60Top 5 by self time (where CPU actually burns):
lstatpost(inspector)(garbage collector)path.resolvereadFileSyncThe hot-path pattern is unambiguous:
createOpenClawTools+createOpenClawCodingToolsrunning synchronous plugin-registry rebuild + provider-auth-env-var resolution on every tool-creation pass, dominated bylstat/readFileSync/existsSync/readdir.loadInstalledPluginIndex(22,638ms inclusive) andloadPluginManifest(905ms self) confirm the plugin index is being rebuilt, not cache-served.resolveProviderAuthEnvVarCandidates→resolveManifestProviderAuthEnvVarCandidates→isProviderApiKeyConfigured→hasAuthForProviderindicates per-tool-call provider-auth resolution that walksprocess.envagainst every model provider's possible env var name.Inferred bug class
In
2026.4.27, the tool-creation path appears to:loadOpenClawPluginsis supposed to provide.Each pass takes ~7s of CPU time, blocking the event loop. Under any non-trivial agent activity (multiple tool calls per minute), the gateway never recovers and watchdog kills it.
This blocks: Telegram long-poll dispatch, agent response generation, all PM2 worker IPC.
Likely overlaps with #73532's "cache key varies between callers" / "cache invalidated too aggressively" hypotheses, but expressed via the tool-creation entry point rather than the mirror-prep one.
Reproduction
openclaw@2026.4.27(or pullc27ae31043c7)gateway-eventloop.jsonlwill show 6000–7700ms stalls per sample windowCPU profiles auto-write to
<DATA>/.openclaw/logs/gateway-stall-profiles/<pid>-<ts>/*.cpuprofile. Loading any in Chrome DevTools → Performance shows the flame graph matching the table above.Suggested next steps
getCachedPluginRegistry/setCachedPluginRegistry(per Plugin loader hot loop: prepareBundledPluginRuntimeDistMirror + JSON5 manifest parsing saturate gateway and starve event loop #73532's suggestion) to log cache key + hit/miss; verify whethercreateOpenClawToolsis cache-missing every call.resolveProviderAuthEnvVarCandidatesfor memoization — if there's no per-process or per-tool cache, the per-invocation env walk is the easy half of this regression to fix.Artifacts available
gateway-eventloop.jsonlexcerptgateway-monitor.logcrash logRelated
prepareBundledPluginRuntimeDistMirror+ JSON5 parse). Adjacent code path, possibly same root cause.