Skip to content

Gateway 7-second event-loop stalls on 2026.4.27 — createOpenClawTools synchronously rebuilds plugin registry + provider-auth env-var resolution per tool-creation pass #74860

@mids-neo

Description

@mids-neo

Summary

After upgrading from 2026.4.152026.4.27, the gateway exhibits ~7-second synchronous event-loop stalls during normal agent activity, blocking Telegram polling and forcing watchdog-triggered container restarts. 24h post-upgrade window: 27 gateway crashes, 25 Tier 2 (full container restart). Pre-upgrade had been operating without this pattern.

.cpuprofile evidence (6 stalls aggregated, 230s combined CPU) shows the dominant hot path is createOpenClawTools / createOpenClawCodingTools invoking loadPluginRegistrySnapshotWithMetadata and resolveProviderAuthEnvVarCandidates synchronously on every tool-creation pass. Top self-time frames are lstat, readFileSync, path.resolve — classic uncached fs scan.

This may share root cause with #73532 (plugin loader hot loop on 4.25/4.26) but differs in entry point and version: that report's hot stack is prepareBundledPluginRuntimeDistMirror (mirror-init); mine is createOpenClawTools (tool-creation per agent invocation). Filing separately so the maintainer can dedup or split as appropriate.

This is distinct from #74345 (ACP session leak / 480s task-registry-maintenance event-loop holds, also on 4.27) — different fingerprint (~7s stalls, no ACP cleanup loop, no session-write-lock warnings).

Environment

  • OpenClaw 2026.4.27 (c27ae31043c7) — installed via global npm
  • Node v22.22.1
  • Linux x86_64, host-mode container (Docker), gateway port 54401
  • Channels enabled: Telegram (multi-bot), WhatsApp
  • ~10 user agents bound to lanes via bindings[]; many tool-call-heavy

Symptom fingerprint

/data/.openclaw/logs/gateway-eventloop.jsonl — continuous ~7s stalls during normal load:

Timestamp (UTC) Stall (ms)
04:17:28Z 6937
04:17:43Z 7130
04:18:49Z 7151
04:19:19Z 7365
04:20:34Z 6942
04:20:49Z 7038
04:23:04Z 7650

ELU peaked at 0.81 across two consecutive samples, with maxMs=6941–7650ms. CPU profiles auto-saved per stall.

gateway-monitor.log — 27 gateway crashes / 24h, 25 Tier 2 (full container restart), concentrated in the post-upgrade window. Watchdog kills gateway, container restarts, load resumes, stalls return.

While stalled: Telegram long-poll stays alive (update-offset-*.json mtime advances), getChat/getChatMember confirm bot-side state is fine, but channel-post-routed agent invocations don't dispatch — the agent isn't getting CPU.

CPU profile analysis (6 stalls aggregated, 230s combined)

Top 5 by inclusive time (caller + descendants):

Inclusive ms % Function Source
51,637 3.2% runEmbeddedAttempt selection-8xKkwZC_.js:5792
46,688 2.9% createOpenClawCodingTools pi-tools-DvRlK7Mt.js:805
46,668 2.9% createOpenClawTools openclaw-tools-CHlPDJlv.js:8767
32,160 2.0% loadPluginRegistrySnapshotWithMetadata plugin-registry-Bjn9rPwy.js:115
24,240 1.5% resolveProviderAuthEnvVarCandidates provider-env-vars-DOy0Czuc.js:60

Top 5 by self time (where CPU actually burns):

Self ms % Function
13,343 5.8% lstat
4,884 2.1% post (inspector)
4,090 1.8% (garbage collector)
3,241 1.4% path.resolve
3,219 1.4% readFileSync

The hot-path pattern is unambiguous: createOpenClawTools + createOpenClawCodingTools running synchronous plugin-registry rebuild + provider-auth-env-var resolution on every tool-creation pass, dominated by lstat / readFileSync / existsSync / readdir.

loadInstalledPluginIndex (22,638ms inclusive) and loadPluginManifest (905ms self) confirm the plugin index is being rebuilt, not cache-served.

resolveProviderAuthEnvVarCandidatesresolveManifestProviderAuthEnvVarCandidatesisProviderApiKeyConfiguredhasAuthForProvider indicates per-tool-call provider-auth resolution that walks process.env against every model provider's possible env var name.

Inferred bug class

In 2026.4.27, the tool-creation path appears to:

  1. Reload the entire plugin registry from disk synchronously on every invocation rather than serving from the cache that loadOpenClawPlugins is supposed to provide.
  2. Re-resolve provider-auth env-var candidates per tool synchronously without caching — this scales with (number of tools × number of providers × number of env-var aliases per provider).

Each pass takes ~7s of CPU time, blocking the event loop. Under any non-trivial agent activity (multiple tool calls per minute), the gateway never recovers and watchdog kills it.

This blocks: Telegram long-poll dispatch, agent response generation, all PM2 worker IPC.

Likely overlaps with #73532's "cache key varies between callers" / "cache invalidated too aggressively" hypotheses, but expressed via the tool-creation entry point rather than the mirror-prep one.

Reproduction

  1. Install openclaw@2026.4.27 (or pull c27ae31043c7)
  2. Configure ≥2 enabled user plugins and ≥2 model providers with auth profiles
  3. Bind ≥1 Telegram bot to an agent and send agent any tool-call-heavy prompt under steady load (~1–2 invocations/minute)
  4. Within minutes, gateway-eventloop.jsonl will show 6000–7700ms stalls per sample window
  5. Within ~1h, watchdog Tier 2 restart will fire

CPU profiles auto-write to <DATA>/.openclaw/logs/gateway-stall-profiles/<pid>-<ts>/*.cpuprofile. Loading any in Chrome DevTools → Performance shows the flame graph matching the table above.

Suggested next steps

Artifacts available

  • 6 .cpuprofile files (10MB total) — happy to attach a redacted bundle on request
  • gateway-eventloop.jsonl excerpt
  • gateway-monitor.log crash log
  • Internal finding doc with full top-10 inclusive/self frame analysis

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions