[Bug]: 2026.5.22 gateway pre-warm (warmCurrentProviderAuthState) blocks event loop ~60s on startup, breaks channel handshakes

### Bug type

Regression (worked before, now fails)

### Beta release blocker

No

### Summary

After upgrading from 2026.5.19 to 2026.5.22, gateway startup blocks the Node event loop for ~60 seconds inside `warmCurrentProviderAuthState`, causing channel handshakes (Discord READY, Feishu bot info, Telegram deleteWebhook) to time out, and leaving inbound messages stalled for ~1 minute on every restart.

### Steps to reproduce

1. On a 2vCPU Linux host (Azure B2als_v2), install `openclaw@2026.5.22` and start the gateway as a systemd user service.
2. Confirm at least one configured agent with multiple model providers (in our case: github-copilot, openai, anthropic, openrouter, plus the default catalog providers).
3. Restart the gateway (`systemctl --user restart openclaw-gateway`) and watch `/tmp/openclaw/openclaw-YYYY-MM-DD.log` and `journalctl _PID=<pid>`.
4. From the moment `gateway ready` is logged, send a Discord DM (or any inbound message) to the bot within the first ~90 seconds.

### Expected behavior

On 2026.5.19 the same restart on the same host completed channel startup in ~5–10 s, and inbound messages received within the first minute were dispatched in <3 s.

### Actual behavior

Two consecutive restarts on 2026.5.22 (PIDs 712063 and 721897, ~30 minutes apart, same config) both reproduced:

- `provider auth state pre-warmed in 58655ms eventLoopMax=36540.8ms` (first restart)
- `provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms` (second restart)
- Liveness warnings during the same window: `event_loop_delay,cpu interval=37s eventLoopDelayP99Ms=14445.2 eventLoopDelayMaxMs=21776.8 eventLoopUtilization=1 cpuCoreRatio=1.041`
- Channel-side fallout, e.g.: `[fetch-timeout] fetch timeout after 10000ms (elapsed 45211ms) timer delayed 35211ms, likely event-loop starvation operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me`, `[discord] gateway READY wait timed out after 15000ms; reconnecting with backoff (attempt 1)`, `[feishu] bot info probe timed out after 30000ms; continuing startup`, `[telegram] deleteWebhook failed: Network request failed`.
- End-to-end Discord inbound latency: first user DM after restart took ~60 s before `session.started` showed up in the trajectory; the model call itself (`github-copilot/gpt-5.5`) took only ~3.9 s. The ~60 s delay is entirely on the inbound/gateway side, dominated by the pre-warm stall + the Discord WS reconnect it triggers.

External network from this host to discord.com/api, gateway.discord.gg, and api.telegram.org is healthy (curl latency 40–680 ms with 200/302), so this is not a transit issue.

### OpenClaw version

2026.5.22 (a374c3a)

### Operating system

Ubuntu 24.04.4 LTS (Azure VM, Standard_B2als_v2, 2 vCPU, 4 GB RAM, japaneast)

### Install method

npm global (`npm i -g openclaw@2026.5.22`)

### Model

github-copilot/gpt-5.5

### Provider / routing chain

openclaw -> github-copilot

### Additional provider/model setup details

The blocked work is provider-auth pre-warm, not the model call itself, so the model/provider path is largely incidental. The config has multiple providers enabled across the agent catalog (github-copilot, openai, anthropic, openrouter, etc.), which appears to amplify the cost (see Root cause below).

### Logs, screenshots, and evidence

Two independent restarts of the same gateway both logged a single-line marker showing the pre-warm wall time and the worst single event-loop block during it:

```
2026-05-24T08:46:04+00:00 [gateway] provider auth state pre-warmed in 58655ms eventLoopMax=36540.8ms
2026-05-24T09:17:43+00:00 [gateway] provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms
```

Liveness warning during the same window:

```
2026-05-24T08:40:17+00:00 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=37s eventLoopDelayP99Ms=14445.2 eventLoopDelayMaxMs=14445.2 eventLoopUtilization=1 cpuCoreRatio=1.041 active=2 waiting=0 queued=0 recentPhases=...,sidecars.session-locks:50912ms work=[active=agent:main:telegram:direct:...(processing/model_call,q=1,age=14s last=model_call:started)|agent:main:discord:direct:...(processing/embedded_run,q=1,age=6s last=embedded_run:started)]
```

Discord-side fallout (second restart):

```
09:16:44.618 INFO  discord client initialized; awaiting gateway readiness
09:17:21.482 ERROR discord: gateway READY wait timed out after 15000ms; reconnecting with backoff (attempt 1)
09:17:21.496 WARN  fetch timeout after 10000ms (elapsed 45211ms) timer delayed 35211ms, likely event-loop starvation operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me
09:17:43.612 INFO  provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms
09:18:29.214 WARN  liveness warning: ... work=[active=agent:main:discord:direct:763966741445083166(processing/embedded_run, age=28s)]
09:18:49.853 (trajectory) session.started   <- first inbound DM finally picked up
09:18:53.231 (trajectory) model.completed   <- ~3.4s model call
```

### Root cause (best-effort, from reading the installed npm package on disk)

In `/usr/lib/node_modules/openclaw/dist/`:

- `server-startup-post-attach-ezNyN6B3.js` calls `warmCurrentProviderAuthState(cfg, { isCancelled })` once per gateway post-attach pass and awaits it; the wall time + worst per-tick stall are then logged via `formatProviderAuthWarmMetrics`.
- `model-provider-auth-DAG1ddFR.js:91 warmCurrentProviderAuthState` is structured as a double `for` loop:
  ```js
  for (const agentId of listAgentIds(cfg)) {
    ensureAuthProfileStore(agentDir, {
      externalCli: externalCliDiscoveryForProviders({ cfg, providers: providerList })
    });
    for (const provider of providers) {
      await hasAuthForModelProvider({ provider, cfg, workspaceDir, agentId, store, runtimeAuthLookup });
    }
  }
  ```
  Each `ensureAuthProfileStore` invokes `externalCliDiscoveryForProviders`, which on Linux can synchronously fan out to external CLI binaries (codex, gemini, claude, gh, etc.) to probe for cached auth. On a 2 vCPU box that combination is hot enough to monopolize the event loop for 30+ s at a time (`eventLoopMax=36876.3ms`) and ~60 s end-to-end.

During that window the Discord channel's 15 s gateway-READY timer fires, forcing a reconnect; the first inbound DM after restart then waits for the reconnect + RESUME, so user-visible latency is roughly `pre-warm wall time + reconnect`.

### Proposed fix shape (not a patch)

- Run `warmCurrentProviderAuthState` after `gateway ready` in the background instead of inside the post-attach awaited path, or at least yield (`setImmediate`/`await scheduler.yield()`) between providers so other handlers run.
- Cache `externalCliDiscoveryForProviders` results across the agent loop (today it appears to re-discover per `ensureAuthProfileStore` call).
- Make per-provider `hasAuthForModelProvider` work `Promise.allSettled` style rather than serial `await`, so a slow `codex login status` style probe does not stall the rest.

### Impact and severity

Affected: every restart of the gateway, every inbound message in the first ~60–90 s window after restart, across all channels (Discord/Telegram/Feishu/Slack all observed timing out their startup probes simultaneously).
Severity: Frustrating but recoverable (gateway eventually catches up).
Frequency: 100% reproducible on restart on this host.
Consequence: Loss of any user message sent during the stall window, or it lands minutes late; Discord WS forced into a reconnect every startup.

### Additional information

Last known good version: **2026.5.19** (we upgraded directly 5.19 → 5.22, skipping 5.20). First known bad version: **2026.5.22**. No workaround attempted yet beyond `systemctl restart`; planning to roll back to 5.19.

This is **not** a duplicate of #85975 / PR #85978 (Codex app-server `thread_bootstrap` native-thread rotation): that path requires the `openai-codex` provider and triggers per-turn, while this stall happens deterministically on every startup with `github-copilot/gpt-5.5` and is gone after the pre-warm finishes. The shared symptom is event-loop starvation, but the source files and trigger are different (`warmCurrentProviderAuthState` here, `rotateOversizedCodexAppServerStartupBinding` there).

---

_Report drafted by an AI agent (Hermes / claude-opus-4.7), reviewed by the human reporter before filing. Evidence above was collected by the agent from the affected host's logs and the installed npm package; the proposed fix shape is the agent's best read of the on-disk code and has not been validated against the source repository._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: 2026.5.22 gateway pre-warm (warmCurrentProviderAuthState) blocks event loop ~60s on startup, breaks channel handshakes #85999

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Root cause (best-effort, from reading the installed npm package on disk)

Proposed fix shape (not a patch)

Impact and severity

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: 2026.5.22 gateway pre-warm (warmCurrentProviderAuthState) blocks event loop ~60s on startup, breaks channel handshakes #85999

Description

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Root cause (best-effort, from reading the installed npm package on disk)

Proposed fix shape (not a patch)

Impact and severity

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions