Bug type
Regression (worked before, now fails)
Beta release blocker
No
Summary
After upgrading from 2026.5.19 to 2026.5.22, gateway startup blocks the Node event loop for ~60 seconds inside warmCurrentProviderAuthState, causing channel handshakes (Discord READY, Feishu bot info, Telegram deleteWebhook) to time out, and leaving inbound messages stalled for ~1 minute on every restart.
Steps to reproduce
- On a 2vCPU Linux host (Azure B2als_v2), install
openclaw@2026.5.22 and start the gateway as a systemd user service.
- Confirm at least one configured agent with multiple model providers (in our case: github-copilot, openai, anthropic, openrouter, plus the default catalog providers).
- Restart the gateway (
systemctl --user restart openclaw-gateway) and watch /tmp/openclaw/openclaw-YYYY-MM-DD.log and journalctl _PID=<pid>.
- From the moment
gateway ready is logged, send a Discord DM (or any inbound message) to the bot within the first ~90 seconds.
Expected behavior
On 2026.5.19 the same restart on the same host completed channel startup in ~5–10 s, and inbound messages received within the first minute were dispatched in <3 s.
Actual behavior
Two consecutive restarts on 2026.5.22 (PIDs 712063 and 721897, ~30 minutes apart, same config) both reproduced:
provider auth state pre-warmed in 58655ms eventLoopMax=36540.8ms (first restart)
provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms (second restart)
- Liveness warnings during the same window:
event_loop_delay,cpu interval=37s eventLoopDelayP99Ms=14445.2 eventLoopDelayMaxMs=21776.8 eventLoopUtilization=1 cpuCoreRatio=1.041
- Channel-side fallout, e.g.:
[fetch-timeout] fetch timeout after 10000ms (elapsed 45211ms) timer delayed 35211ms, likely event-loop starvation operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me, [discord] gateway READY wait timed out after 15000ms; reconnecting with backoff (attempt 1), [feishu] bot info probe timed out after 30000ms; continuing startup, [telegram] deleteWebhook failed: Network request failed.
- End-to-end Discord inbound latency: first user DM after restart took ~60 s before
session.started showed up in the trajectory; the model call itself (github-copilot/gpt-5.5) took only ~3.9 s. The ~60 s delay is entirely on the inbound/gateway side, dominated by the pre-warm stall + the Discord WS reconnect it triggers.
External network from this host to discord.com/api, gateway.discord.gg, and api.telegram.org is healthy (curl latency 40–680 ms with 200/302), so this is not a transit issue.
OpenClaw version
2026.5.22 (a374c3a)
Operating system
Ubuntu 24.04.4 LTS (Azure VM, Standard_B2als_v2, 2 vCPU, 4 GB RAM, japaneast)
Install method
npm global (npm i -g openclaw@2026.5.22)
Model
github-copilot/gpt-5.5
Provider / routing chain
openclaw -> github-copilot
Additional provider/model setup details
The blocked work is provider-auth pre-warm, not the model call itself, so the model/provider path is largely incidental. The config has multiple providers enabled across the agent catalog (github-copilot, openai, anthropic, openrouter, etc.), which appears to amplify the cost (see Root cause below).
Logs, screenshots, and evidence
Two independent restarts of the same gateway both logged a single-line marker showing the pre-warm wall time and the worst single event-loop block during it:
2026-05-24T08:46:04+00:00 [gateway] provider auth state pre-warmed in 58655ms eventLoopMax=36540.8ms
2026-05-24T09:17:43+00:00 [gateway] provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms
Liveness warning during the same window:
2026-05-24T08:40:17+00:00 [diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=37s eventLoopDelayP99Ms=14445.2 eventLoopDelayMaxMs=14445.2 eventLoopUtilization=1 cpuCoreRatio=1.041 active=2 waiting=0 queued=0 recentPhases=...,sidecars.session-locks:50912ms work=[active=agent:main:telegram:direct:...(processing/model_call,q=1,age=14s last=model_call:started)|agent:main:discord:direct:...(processing/embedded_run,q=1,age=6s last=embedded_run:started)]
Discord-side fallout (second restart):
09:16:44.618 INFO discord client initialized; awaiting gateway readiness
09:17:21.482 ERROR discord: gateway READY wait timed out after 15000ms; reconnecting with backoff (attempt 1)
09:17:21.496 WARN fetch timeout after 10000ms (elapsed 45211ms) timer delayed 35211ms, likely event-loop starvation operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me
09:17:43.612 INFO provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms
09:18:29.214 WARN liveness warning: ... work=[active=agent:main:discord:direct:763966741445083166(processing/embedded_run, age=28s)]
09:18:49.853 (trajectory) session.started <- first inbound DM finally picked up
09:18:53.231 (trajectory) model.completed <- ~3.4s model call
Root cause (best-effort, from reading the installed npm package on disk)
In /usr/lib/node_modules/openclaw/dist/:
server-startup-post-attach-ezNyN6B3.js calls warmCurrentProviderAuthState(cfg, { isCancelled }) once per gateway post-attach pass and awaits it; the wall time + worst per-tick stall are then logged via formatProviderAuthWarmMetrics.
model-provider-auth-DAG1ddFR.js:91 warmCurrentProviderAuthState is structured as a double for loop:
for (const agentId of listAgentIds(cfg)) {
ensureAuthProfileStore(agentDir, {
externalCli: externalCliDiscoveryForProviders({ cfg, providers: providerList })
});
for (const provider of providers) {
await hasAuthForModelProvider({ provider, cfg, workspaceDir, agentId, store, runtimeAuthLookup });
}
}
Each ensureAuthProfileStore invokes externalCliDiscoveryForProviders, which on Linux can synchronously fan out to external CLI binaries (codex, gemini, claude, gh, etc.) to probe for cached auth. On a 2 vCPU box that combination is hot enough to monopolize the event loop for 30+ s at a time (eventLoopMax=36876.3ms) and ~60 s end-to-end.
During that window the Discord channel's 15 s gateway-READY timer fires, forcing a reconnect; the first inbound DM after restart then waits for the reconnect + RESUME, so user-visible latency is roughly pre-warm wall time + reconnect.
Proposed fix shape (not a patch)
- Run
warmCurrentProviderAuthState after gateway ready in the background instead of inside the post-attach awaited path, or at least yield (setImmediate/await scheduler.yield()) between providers so other handlers run.
- Cache
externalCliDiscoveryForProviders results across the agent loop (today it appears to re-discover per ensureAuthProfileStore call).
- Make per-provider
hasAuthForModelProvider work Promise.allSettled style rather than serial await, so a slow codex login status style probe does not stall the rest.
Impact and severity
Affected: every restart of the gateway, every inbound message in the first ~60–90 s window after restart, across all channels (Discord/Telegram/Feishu/Slack all observed timing out their startup probes simultaneously).
Severity: Frustrating but recoverable (gateway eventually catches up).
Frequency: 100% reproducible on restart on this host.
Consequence: Loss of any user message sent during the stall window, or it lands minutes late; Discord WS forced into a reconnect every startup.
Additional information
Last known good version: 2026.5.19 (we upgraded directly 5.19 → 5.22, skipping 5.20). First known bad version: 2026.5.22. No workaround attempted yet beyond systemctl restart; planning to roll back to 5.19.
This is not a duplicate of #85975 / PR #85978 (Codex app-server thread_bootstrap native-thread rotation): that path requires the openai-codex provider and triggers per-turn, while this stall happens deterministically on every startup with github-copilot/gpt-5.5 and is gone after the pre-warm finishes. The shared symptom is event-loop starvation, but the source files and trigger are different (warmCurrentProviderAuthState here, rotateOversizedCodexAppServerStartupBinding there).
Report drafted by an AI agent (Hermes / claude-opus-4.7), reviewed by the human reporter before filing. Evidence above was collected by the agent from the affected host's logs and the installed npm package; the proposed fix shape is the agent's best read of the on-disk code and has not been validated against the source repository.
Bug type
Regression (worked before, now fails)
Beta release blocker
No
Summary
After upgrading from 2026.5.19 to 2026.5.22, gateway startup blocks the Node event loop for ~60 seconds inside
warmCurrentProviderAuthState, causing channel handshakes (Discord READY, Feishu bot info, Telegram deleteWebhook) to time out, and leaving inbound messages stalled for ~1 minute on every restart.Steps to reproduce
openclaw@2026.5.22and start the gateway as a systemd user service.systemctl --user restart openclaw-gateway) and watch/tmp/openclaw/openclaw-YYYY-MM-DD.logandjournalctl _PID=<pid>.gateway readyis logged, send a Discord DM (or any inbound message) to the bot within the first ~90 seconds.Expected behavior
On 2026.5.19 the same restart on the same host completed channel startup in ~5–10 s, and inbound messages received within the first minute were dispatched in <3 s.
Actual behavior
Two consecutive restarts on 2026.5.22 (PIDs 712063 and 721897, ~30 minutes apart, same config) both reproduced:
provider auth state pre-warmed in 58655ms eventLoopMax=36540.8ms(first restart)provider auth state pre-warmed in 67840ms eventLoopMax=36876.3ms(second restart)event_loop_delay,cpu interval=37s eventLoopDelayP99Ms=14445.2 eventLoopDelayMaxMs=21776.8 eventLoopUtilization=1 cpuCoreRatio=1.041[fetch-timeout] fetch timeout after 10000ms (elapsed 45211ms) timer delayed 35211ms, likely event-loop starvation operation=fetchWithTimeout url=https://discord.com/api/v10/users/@me,[discord] gateway READY wait timed out after 15000ms; reconnecting with backoff (attempt 1),[feishu] bot info probe timed out after 30000ms; continuing startup,[telegram] deleteWebhook failed: Network request failed.session.startedshowed up in the trajectory; the model call itself (github-copilot/gpt-5.5) took only ~3.9 s. The ~60 s delay is entirely on the inbound/gateway side, dominated by the pre-warm stall + the Discord WS reconnect it triggers.External network from this host to discord.com/api, gateway.discord.gg, and api.telegram.org is healthy (curl latency 40–680 ms with 200/302), so this is not a transit issue.
OpenClaw version
2026.5.22 (a374c3a)
Operating system
Ubuntu 24.04.4 LTS (Azure VM, Standard_B2als_v2, 2 vCPU, 4 GB RAM, japaneast)
Install method
npm global (
npm i -g openclaw@2026.5.22)Model
github-copilot/gpt-5.5
Provider / routing chain
openclaw -> github-copilot
Additional provider/model setup details
The blocked work is provider-auth pre-warm, not the model call itself, so the model/provider path is largely incidental. The config has multiple providers enabled across the agent catalog (github-copilot, openai, anthropic, openrouter, etc.), which appears to amplify the cost (see Root cause below).
Logs, screenshots, and evidence
Two independent restarts of the same gateway both logged a single-line marker showing the pre-warm wall time and the worst single event-loop block during it:
Liveness warning during the same window:
Discord-side fallout (second restart):
Root cause (best-effort, from reading the installed npm package on disk)
In
/usr/lib/node_modules/openclaw/dist/:server-startup-post-attach-ezNyN6B3.jscallswarmCurrentProviderAuthState(cfg, { isCancelled })once per gateway post-attach pass and awaits it; the wall time + worst per-tick stall are then logged viaformatProviderAuthWarmMetrics.model-provider-auth-DAG1ddFR.js:91 warmCurrentProviderAuthStateis structured as a doubleforloop:ensureAuthProfileStoreinvokesexternalCliDiscoveryForProviders, which on Linux can synchronously fan out to external CLI binaries (codex, gemini, claude, gh, etc.) to probe for cached auth. On a 2 vCPU box that combination is hot enough to monopolize the event loop for 30+ s at a time (eventLoopMax=36876.3ms) and ~60 s end-to-end.During that window the Discord channel's 15 s gateway-READY timer fires, forcing a reconnect; the first inbound DM after restart then waits for the reconnect + RESUME, so user-visible latency is roughly
pre-warm wall time + reconnect.Proposed fix shape (not a patch)
warmCurrentProviderAuthStateaftergateway readyin the background instead of inside the post-attach awaited path, or at least yield (setImmediate/await scheduler.yield()) between providers so other handlers run.externalCliDiscoveryForProvidersresults across the agent loop (today it appears to re-discover perensureAuthProfileStorecall).hasAuthForModelProviderworkPromise.allSettledstyle rather than serialawait, so a slowcodex login statusstyle probe does not stall the rest.Impact and severity
Affected: every restart of the gateway, every inbound message in the first ~60–90 s window after restart, across all channels (Discord/Telegram/Feishu/Slack all observed timing out their startup probes simultaneously).
Severity: Frustrating but recoverable (gateway eventually catches up).
Frequency: 100% reproducible on restart on this host.
Consequence: Loss of any user message sent during the stall window, or it lands minutes late; Discord WS forced into a reconnect every startup.
Additional information
Last known good version: 2026.5.19 (we upgraded directly 5.19 → 5.22, skipping 5.20). First known bad version: 2026.5.22. No workaround attempted yet beyond
systemctl restart; planning to roll back to 5.19.This is not a duplicate of #85975 / PR #85978 (Codex app-server
thread_bootstrapnative-thread rotation): that path requires theopenai-codexprovider and triggers per-turn, while this stall happens deterministically on every startup withgithub-copilot/gpt-5.5and is gone after the pre-warm finishes. The shared symptom is event-loop starvation, but the source files and trigger are different (warmCurrentProviderAuthStatehere,rotateOversizedCodexAppServerStartupBindingthere).Report drafted by an AI agent (Hermes / claude-opus-4.7), reviewed by the human reporter before filing. Evidence above was collected by the agent from the affected host's logs and the installed npm package; the proposed fix shape is the agent's best read of the on-disk code and has not been validated against the source repository.