Gateway 7-second event-loop stalls on 2026.4.27 — createOpenClawTools synchronously rebuilds plugin registry + provider-auth env-var resolution per tool-creation pass

## Summary

After upgrading from `2026.4.15` → `2026.4.27`, the gateway exhibits ~7-second synchronous event-loop stalls during normal agent activity, blocking Telegram polling and forcing watchdog-triggered container restarts. 24h post-upgrade window: **27 gateway crashes, 25 Tier 2 (full container restart)**. Pre-upgrade had been operating without this pattern.

`.cpuprofile` evidence (6 stalls aggregated, 230s combined CPU) shows the dominant hot path is `createOpenClawTools` / `createOpenClawCodingTools` invoking `loadPluginRegistrySnapshotWithMetadata` and `resolveProviderAuthEnvVarCandidates` synchronously on every tool-creation pass. Top self-time frames are `lstat`, `readFileSync`, `path.resolve` — classic uncached fs scan.

This may share root cause with **#73532** (plugin loader hot loop on 4.25/4.26) but differs in entry point and version: that report's hot stack is `prepareBundledPluginRuntimeDistMirror` (mirror-init); mine is `createOpenClawTools` (tool-creation per agent invocation). Filing separately so the maintainer can dedup or split as appropriate.

This is **distinct from #74345** (ACP session leak / 480s `task-registry-maintenance` event-loop holds, also on 4.27) — different fingerprint (~7s stalls, no ACP cleanup loop, no session-write-lock warnings).

## Environment

- OpenClaw `2026.4.27` (`c27ae31043c7`) — installed via global npm
- Node `v22.22.1`
- Linux x86_64, host-mode container (Docker), gateway port 54401
- Channels enabled: Telegram (multi-bot), WhatsApp
- ~10 user agents bound to lanes via `bindings[]`; many tool-call-heavy

## Symptom fingerprint

`/data/.openclaw/logs/gateway-eventloop.jsonl` — continuous ~7s stalls during normal load:

| Timestamp (UTC) | Stall (ms) |
|-----------------|-----------:|
| 04:17:28Z       | 6937       |
| 04:17:43Z       | 7130       |
| 04:18:49Z       | 7151       |
| 04:19:19Z       | 7365       |
| 04:20:34Z       | 6942       |
| 04:20:49Z       | 7038       |
| 04:23:04Z       | 7650       |

ELU peaked at 0.81 across two consecutive samples, with `maxMs=6941–7650ms`. CPU profiles auto-saved per stall.

`gateway-monitor.log` — 27 gateway crashes / 24h, **25 Tier 2 (full container restart)**, concentrated in the post-upgrade window. Watchdog kills gateway, container restarts, load resumes, stalls return.

While stalled: Telegram long-poll stays alive (`update-offset-*.json` mtime advances), `getChat`/`getChatMember` confirm bot-side state is fine, but channel-post-routed agent invocations don't dispatch — the agent isn't getting CPU.

## CPU profile analysis (6 stalls aggregated, 230s combined)

Top 5 by **inclusive** time (caller + descendants):

| Inclusive ms | %    | Function                              | Source                              |
|-------------:|-----:|---------------------------------------|-------------------------------------|
| 51,637       | 3.2% | `runEmbeddedAttempt`                  | `selection-8xKkwZC_.js:5792`        |
| 46,688       | 2.9% | `createOpenClawCodingTools`           | `pi-tools-DvRlK7Mt.js:805`          |
| 46,668       | 2.9% | `createOpenClawTools`                 | `openclaw-tools-CHlPDJlv.js:8767`   |
| 32,160       | 2.0% | `loadPluginRegistrySnapshotWithMetadata` | `plugin-registry-Bjn9rPwy.js:115` |
| 24,240       | 1.5% | `resolveProviderAuthEnvVarCandidates` | `provider-env-vars-DOy0Czuc.js:60`  |

Top 5 by **self** time (where CPU actually burns):

| Self ms | %    | Function                |
|--------:|-----:|-------------------------|
| 13,343  | 5.8% | `lstat`                 |
| 4,884   | 2.1% | `post` (inspector)      |
| 4,090   | 1.8% | `(garbage collector)`   |
| 3,241   | 1.4% | `path.resolve`          |
| 3,219   | 1.4% | `readFileSync`          |

The hot-path pattern is unambiguous: `createOpenClawTools` + `createOpenClawCodingTools` running synchronous plugin-registry rebuild + provider-auth-env-var resolution on every tool-creation pass, dominated by `lstat` / `readFileSync` / `existsSync` / `readdir`.

`loadInstalledPluginIndex` (22,638ms inclusive) and `loadPluginManifest` (905ms self) confirm the plugin index is being **rebuilt**, not cache-served.

`resolveProviderAuthEnvVarCandidates` → `resolveManifestProviderAuthEnvVarCandidates` → `isProviderApiKeyConfigured` → `hasAuthForProvider` indicates per-tool-call provider-auth resolution that walks `process.env` against every model provider's possible env var name.

## Inferred bug class

In `2026.4.27`, the tool-creation path appears to:

1. **Reload the entire plugin registry from disk synchronously on every invocation** rather than serving from the cache that `loadOpenClawPlugins` is supposed to provide.
2. **Re-resolve provider-auth env-var candidates per tool synchronously without caching** — this scales with (number of tools × number of providers × number of env-var aliases per provider).

Each pass takes ~7s of CPU time, blocking the event loop. Under any non-trivial agent activity (multiple tool calls per minute), the gateway never recovers and watchdog kills it.

This blocks: Telegram long-poll dispatch, agent response generation, all PM2 worker IPC.

Likely overlaps with #73532's "cache key varies between callers" / "cache invalidated too aggressively" hypotheses, but expressed via the tool-creation entry point rather than the mirror-prep one.

## Reproduction

1. Install `openclaw@2026.4.27` (or pull `c27ae31043c7`)
2. Configure ≥2 enabled user plugins and ≥2 model providers with auth profiles
3. Bind ≥1 Telegram bot to an agent and send agent any tool-call-heavy prompt under steady load (~1–2 invocations/minute)
4. Within minutes, `gateway-eventloop.jsonl` will show 6000–7700ms stalls per sample window
5. Within ~1h, watchdog Tier 2 restart will fire

CPU profiles auto-write to `<DATA>/.openclaw/logs/gateway-stall-profiles/<pid>-<ts>/*.cpuprofile`. Loading any in Chrome DevTools → Performance shows the flame graph matching the table above.

## Suggested next steps

- Instrument `getCachedPluginRegistry` / `setCachedPluginRegistry` (per #73532's suggestion) to log cache key + hit/miss; verify whether `createOpenClawTools` is cache-missing every call.
- Audit `resolveProviderAuthEnvVarCandidates` for memoization — if there's no per-process or per-tool cache, the per-invocation env walk is the easy half of this regression to fix.
- Consider whether the 4.15 → 4.27 plugin-loader refactor moved registry-load from startup-time to tool-factory-time, and whether memoization survived the refactor.

## Artifacts available

- 6 .cpuprofile files (10MB total) — happy to attach a redacted bundle on request
- `gateway-eventloop.jsonl` excerpt
- `gateway-monitor.log` crash log
- Internal finding doc with full top-10 inclusive/self frame analysis

## Related

- #73532 — plugin loader hot loop (`prepareBundledPluginRuntimeDistMirror` + JSON5 parse). Adjacent code path, possibly same root cause.
- #74345 — ACP session leak / 480s task-registry-maintenance holds on 4.27. Distinct fingerprint.
- #73467 — main thread stalls under orchestration load on 4.26. Possibly upstream of the same regression.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gateway 7-second event-loop stalls on 2026.4.27 — createOpenClawTools synchronously rebuilds plugin registry + provider-auth env-var resolution per tool-creation pass #74860

Summary

Environment

Symptom fingerprint

CPU profile analysis (6 stalls aggregated, 230s combined)

Inferred bug class

Reproduction

Suggested next steps

Artifacts available

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Timestamp (UTC)	Stall (ms)
04:17:28Z	6937
04:17:43Z	7130
04:18:49Z	7151
04:19:19Z	7365
04:20:34Z	6942
04:20:49Z	7038
04:23:04Z	7650

Inclusive ms	%	Function	Source
51,637	3.2%	`runEmbeddedAttempt`	`selection-8xKkwZC_.js:5792`
46,688	2.9%	`createOpenClawCodingTools`	`pi-tools-DvRlK7Mt.js:805`
46,668	2.9%	`createOpenClawTools`	`openclaw-tools-CHlPDJlv.js:8767`
32,160	2.0%	`loadPluginRegistrySnapshotWithMetadata`	`plugin-registry-Bjn9rPwy.js:115`
24,240	1.5%	`resolveProviderAuthEnvVarCandidates`	`provider-env-vars-DOy0Czuc.js:60`

Self ms	%	Function
13,343	5.8%	`lstat`
4,884	2.1%	`post` (inspector)
4,090	1.8%	`(garbage collector)`
3,241	1.4%	`path.resolve`
3,219	1.4%	`readFileSync`

Uh oh!

Gateway 7-second event-loop stalls on 2026.4.27 — createOpenClawTools synchronously rebuilds plugin registry + provider-auth env-var resolution per tool-creation pass #74860

Description

Summary

Environment

Symptom fingerprint

CPU profile analysis (6 stalls aggregated, 230s combined)

Inferred bug class

Reproduction

Suggested next steps

Artifacts available

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions