Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery
Summary
The OpenClaw Gateway intermittently becomes unresponsive for tens of seconds. During these windows, WebSocket preauth handshakes time out, but the timeout callbacks fire much later than the configured 10s timeout. Observed values range from ~18s to ~72s.
This strongly suggests the Gateway Node.js event loop is being blocked or starved by synchronous / CPU-heavy work inside the gateway process, rather than a pure networking or client-side timeout issue.
The strongest current suspect is expensive model catalog regeneration / provider discovery after model/config hot reloads, especially around:
ensureOpenClawModelsJson()
planOpenClawModelsJson()
resolveProvidersForModelsJsonWithDeps()
resolveImplicitProviders()
resolveRuntimePluginDiscoveryProviders()
- dynamic import of
plugins/provider-discovery.runtime.js
resolvePluginDiscoveryProvidersRuntime() / resolvePluginProviders()
This is an initial issue with current findings. I can add deeper traces if needed.
Environment
- OS: Ubuntu 22.04.5 LTS, arm64
- Node.js: v24.14.1
- Gateway bound on LAN, local port
18789
- Gateway process remains alive during incidents
- No obvious FD leak observed during checks
- Host not thermally throttled during later checks:
- 4 cores
- temperature around ~41°C
vcgencmd get_throttled = 0x0
Symptoms
Observed in Gateway logs:
- WebSocket connections close before full connect/preauth.
- Important class:
cause=handshake-timeout.
- Configured preauth handshake timeout is 10s.
- Actual logged
handshakeMs / durationMs values were much larger:
Examples:
handshakeMs=47133
handshakeMs=54995
handshakeMs=49755
handshakeMs=71389
handshakeMs=18201
handshakeMs=34302
Because the timer callback itself fires late, this looks like event-loop starvation/blocking.
Additional symptoms seen during affected windows:
- Some Gateway RPCs occasionally slow:
config.get ~2.1–2.4s
channels.status ~2.0s
config.schema.lookup ~3.2s
agent.wait reached expected long wait timeout
- Telegram polling stalls were also seen near some incident windows.
openclaw status --json --timeout 1000 had ~14s wall time despite reported gateway connect latency around ~160ms.
openclaw logs --limit 5 --timeout 5000 had ~12s wall time, while the underlying logs.tail RPC later measured fast after warmup.
Findings
1. Not likely to be a simple network / socket issue
The Gateway was alive and listening on 0.0.0.0:18789.
HTTP/dashboard probes to 127.0.0.1:18789 were healthy when idle:
- first probe around ~285ms
- later probes around ~30ms
No obvious FD leak was observed:
- FD count around mid/high 30s
- threads around 11
2. Preauth timer firing late is the key evidence
Code inspection found:
DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS = 10000
- server uses
getPreauthHandshakeTimeoutMsFromEnv() for the preauth timer
But logs showed timeout callbacks firing after 18–72s. That strongly indicates the event loop was unable to run the timer callback on time.
3. Session/transcript sync reads are a smell, but probably not the sole root cause
Several gateway-adjacent paths use synchronous file reads/parsing:
- session store reads via
readSessionStoreReadOnly
- transcript reads via
readSessionMessages()
- some status/history/session preview paths
This is concerning, especially for large transcripts/checkpoints. However, microbenchmarks on the largest local artifacts showed individual parse costs mostly in the ~100–500ms range, not enough alone to explain 70s stalls.
So this remains a contributor/smell, but not the strongest single root cause.
4. Late handshake windows correlate with state churn
The worst late-handshake windows clustered around:
- config/model/default alias hot reloads
- provider/model catalog related changes
- pairing/device-state churn
- stuck session / embedded-run lifecycle events
This points toward bursts of gateway-side state recomputation or plugin/runtime work.
5. Stronger suspect: model catalog / provider discovery cold path
Local probes outside the gateway showed ensureOpenClawModelsJson() can be extremely expensive:
-
importing model-catalog-*.js in a fresh Node process took ~6.5s
-
loadModelCatalog() / ensureOpenClawModelsJson() without an outer provider-discovery timeout did not complete within test timeouts in some probes
-
with:
OPENCLAW_LIVE_GATEWAY=1
OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS=3000
OPENCLAW_LIVE_GATEWAY_PROVIDERS=codex
it completed in ~20.7s and logged that Codex provider catalog discovery timed out after 3000ms.
Filtered single-provider probes still showed high wall times:
anthropic-vertex: ~13.9s
deepseek: ~38.6s
codex with 3s provider timeout: ~20.7s
This suggests the problem is not only one slow live provider catalog call. There appears to be significant plugin/runtime/provider setup overhead.
6. Provider filtering appears insufficient for the full path
Direct call:
resolvePluginDiscoveryProvidersRuntime({ onlyPluginIds: ['deepseek'] })
behaved mostly as expected:
- entries-only returned only
deepseek in ~81ms
- normal only-plugin run returned only
deepseek in ~7.6s
But when running through ensureOpenClawModelsJson() with provider filtering, profiling still showed broad provider plugin loading, including:
anthropic
byteplus
deepseek
moonshot
openai
tencent
volcengine
xai
Even a minimized in-memory cfg.models.providers = { deepseek: ... } still showed broad provider plugin load/profile activity.
A later source-path refinement makes provider-specific policy helpers less likely to be the cold-load trigger: normalizeProviderConfigPolicy(), applyProviderNativeStreamingUsagePolicy(), and resolveProviderConfigApiKeyPolicy() call their plugin helpers with allowRuntimePluginLoad: false.
So the stronger suspect is the runtime discovery stage before/around normalization:
resolveProvidersForModelsJsonWithDeps()
-> resolveImplicitProviders()
-> resolveRuntimePluginDiscoveryProviders()
-> dynamic import of plugins/provider-discovery.runtime.js
-> resolvePluginDiscoveryProvidersRuntime()
-> resolvePluginProviders()
In particular, resolvePluginDiscoveryProvidersRuntime() has a selective full-plugin fallback when onlyPluginIds is undefined. If the provider filter is not passed through correctly from config/hot-reload context, or if source snapshot projection reintroduces providers, this can cause broad provider loading.
7. Warm cache is much faster, cold path is the problem
Within one process:
- first full provider discovery call: ~17.5s
- second call: ~1.3s
- third call: ~0.7s
So caching helps, but config/model hot reloads or cache invalidation can re-enter the expensive cold path.
Current diagnosis
The Gateway can become temporarily unresponsive because model catalog regeneration / provider discovery performs expensive cold plugin/runtime work in or adjacent to latency-sensitive Gateway paths.
This can delay timers, WebSocket handshakes, CLI/RPC responses, Telegram polling, and possibly status/log commands.
The issue is likely not one single slow transcript read, but a combination of:
- expensive cold provider/plugin runtime loading,
- insufficient provider filtering in the full model-catalog regeneration path,
- potentially unbounded or poorly bounded live provider discovery,
- cache invalidation after model/config hot reloads,
- work running on the Gateway event loop instead of being deferred/offloaded.
Proposed fixes / mitigations
Short-term mitigations
-
Add a bounded default timeout for live provider discovery in Gateway contexts.
- Example: default
OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS to a small value when running inside gateway.
- Avoid unbounded provider catalog resolution during hot reload.
-
Avoid running full ensureOpenClawModelsJson() synchronously after config/model hot reload.
- Defer regeneration.
- Serve stale cached catalog while refresh happens in background.
- Emit a warning if refresh fails or times out.
-
Ensure provider discovery filters are passed through the full path.
- In particular, check what
onlyPluginIds is passed into resolveRuntimePluginDiscoveryProviders() during ensureOpenClawModelsJson() after config/model hot reloads.
- Avoid broad
resolvePluginProviders() when a narrowed provider set is known.
-
Add method-level timing logs around:
applyHotReload()
resetModelCatalogCache()
ensureOpenClawModelsJson()
planOpenClawModelsJson()
resolveProvidersForModelsJsonWithDeps()
resolveImplicitProviders()
resolveRuntimePluginDiscoveryProviders()
resolvePluginDiscoveryProvidersRuntime()
resolvePluginProviders()
readSessionMessages()
-
Add runtime event-loop-delay instrumentation in the Gateway.
- For example, log when event-loop delay exceeds 1s / 5s / 10s, including the current high-level operation if known.
Longer-term fixes
-
Move expensive model catalog/provider discovery work out of the Gateway request/event-loop path.
- Use a worker thread or child process.
- Keep Gateway responsive while catalog refresh happens.
-
Make provider discovery fully cache-aware and cancellation-aware.
- Timeouts should abort work cleanly, not merely stop waiting while leaving expensive tasks/resources alive.
-
Separate static provider config normalization from live provider discovery.
- Normalization should not require broad plugin loading if static config is already known.
-
Improve CLI command decomposition.
- Commands like
openclaw status / openclaw logs should avoid unnecessary status/bootstrap/provider/model work when the requested RPC itself is cheap.
Reproduction / diagnostic hints
Useful probes:
- Compare
logs.tail RPC time vs full openclaw logs command wall time.
- Trigger model/default alias config hot reload, then watch for late WS handshake timers.
- Run
ensureOpenClawModelsJson() in a fresh Node process with plugin profiling enabled.
- Test with and without provider discovery timeout:
OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS=3000
OPENCLAW_LIVE_GATEWAY_PROVIDERS=codex
OPENCLAW_PLUGIN_LOAD_PROFILE=1
Key expected signal:
- A 10s WebSocket preauth timer firing after much more than 10s is the clearest symptom of event-loop delay.
Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery
Summary
The OpenClaw Gateway intermittently becomes unresponsive for tens of seconds. During these windows, WebSocket preauth handshakes time out, but the timeout callbacks fire much later than the configured 10s timeout. Observed values range from ~18s to ~72s.
This strongly suggests the Gateway Node.js event loop is being blocked or starved by synchronous / CPU-heavy work inside the gateway process, rather than a pure networking or client-side timeout issue.
The strongest current suspect is expensive model catalog regeneration / provider discovery after model/config hot reloads, especially around:
ensureOpenClawModelsJson()planOpenClawModelsJson()resolveProvidersForModelsJsonWithDeps()resolveImplicitProviders()resolveRuntimePluginDiscoveryProviders()plugins/provider-discovery.runtime.jsresolvePluginDiscoveryProvidersRuntime()/resolvePluginProviders()This is an initial issue with current findings. I can add deeper traces if needed.
Environment
18789vcgencmd get_throttled = 0x0Symptoms
Observed in Gateway logs:
cause=handshake-timeout.handshakeMs/durationMsvalues were much larger:Examples:
handshakeMs=47133handshakeMs=54995handshakeMs=49755handshakeMs=71389handshakeMs=18201handshakeMs=34302Because the timer callback itself fires late, this looks like event-loop starvation/blocking.
Additional symptoms seen during affected windows:
config.get~2.1–2.4schannels.status~2.0sconfig.schema.lookup~3.2sagent.waitreached expected long wait timeoutopenclaw status --json --timeout 1000had ~14s wall time despite reported gateway connect latency around ~160ms.openclaw logs --limit 5 --timeout 5000had ~12s wall time, while the underlyinglogs.tailRPC later measured fast after warmup.Findings
1. Not likely to be a simple network / socket issue
The Gateway was alive and listening on
0.0.0.0:18789.HTTP/dashboard probes to
127.0.0.1:18789were healthy when idle:No obvious FD leak was observed:
2. Preauth timer firing late is the key evidence
Code inspection found:
DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS = 10000getPreauthHandshakeTimeoutMsFromEnv()for the preauth timerBut logs showed timeout callbacks firing after 18–72s. That strongly indicates the event loop was unable to run the timer callback on time.
3. Session/transcript sync reads are a smell, but probably not the sole root cause
Several gateway-adjacent paths use synchronous file reads/parsing:
readSessionStoreReadOnlyreadSessionMessages()This is concerning, especially for large transcripts/checkpoints. However, microbenchmarks on the largest local artifacts showed individual parse costs mostly in the ~100–500ms range, not enough alone to explain 70s stalls.
So this remains a contributor/smell, but not the strongest single root cause.
4. Late handshake windows correlate with state churn
The worst late-handshake windows clustered around:
This points toward bursts of gateway-side state recomputation or plugin/runtime work.
5. Stronger suspect: model catalog / provider discovery cold path
Local probes outside the gateway showed
ensureOpenClawModelsJson()can be extremely expensive:importing
model-catalog-*.jsin a fresh Node process took ~6.5sloadModelCatalog()/ensureOpenClawModelsJson()without an outer provider-discovery timeout did not complete within test timeouts in some probeswith:
OPENCLAW_LIVE_GATEWAY=1OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS=3000OPENCLAW_LIVE_GATEWAY_PROVIDERS=codexit completed in ~20.7s and logged that Codex provider catalog discovery timed out after 3000ms.
Filtered single-provider probes still showed high wall times:
anthropic-vertex: ~13.9sdeepseek: ~38.6scodexwith 3s provider timeout: ~20.7sThis suggests the problem is not only one slow live provider catalog call. There appears to be significant plugin/runtime/provider setup overhead.
6. Provider filtering appears insufficient for the full path
Direct call:
behaved mostly as expected:
deepseekin ~81msdeepseekin ~7.6sBut when running through
ensureOpenClawModelsJson()with provider filtering, profiling still showed broad provider plugin loading, including:anthropicbyteplusdeepseekmoonshotopenaitencentvolcenginexaiEven a minimized in-memory
cfg.models.providers = { deepseek: ... }still showed broad provider plugin load/profile activity.A later source-path refinement makes provider-specific policy helpers less likely to be the cold-load trigger:
normalizeProviderConfigPolicy(),applyProviderNativeStreamingUsagePolicy(), andresolveProviderConfigApiKeyPolicy()call their plugin helpers withallowRuntimePluginLoad: false.So the stronger suspect is the runtime discovery stage before/around normalization:
In particular,
resolvePluginDiscoveryProvidersRuntime()has a selective full-plugin fallback whenonlyPluginIdsis undefined. If the provider filter is not passed through correctly from config/hot-reload context, or if source snapshot projection reintroduces providers, this can cause broad provider loading.7. Warm cache is much faster, cold path is the problem
Within one process:
So caching helps, but config/model hot reloads or cache invalidation can re-enter the expensive cold path.
Current diagnosis
The Gateway can become temporarily unresponsive because model catalog regeneration / provider discovery performs expensive cold plugin/runtime work in or adjacent to latency-sensitive Gateway paths.
This can delay timers, WebSocket handshakes, CLI/RPC responses, Telegram polling, and possibly status/log commands.
The issue is likely not one single slow transcript read, but a combination of:
Proposed fixes / mitigations
Short-term mitigations
Add a bounded default timeout for live provider discovery in Gateway contexts.
OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MSto a small value when running inside gateway.Avoid running full
ensureOpenClawModelsJson()synchronously after config/model hot reload.Ensure provider discovery filters are passed through the full path.
onlyPluginIdsis passed intoresolveRuntimePluginDiscoveryProviders()duringensureOpenClawModelsJson()after config/model hot reloads.resolvePluginProviders()when a narrowed provider set is known.Add method-level timing logs around:
applyHotReload()resetModelCatalogCache()ensureOpenClawModelsJson()planOpenClawModelsJson()resolveProvidersForModelsJsonWithDeps()resolveImplicitProviders()resolveRuntimePluginDiscoveryProviders()resolvePluginDiscoveryProvidersRuntime()resolvePluginProviders()readSessionMessages()Add runtime event-loop-delay instrumentation in the Gateway.
Longer-term fixes
Move expensive model catalog/provider discovery work out of the Gateway request/event-loop path.
Make provider discovery fully cache-aware and cancellation-aware.
Separate static provider config normalization from live provider discovery.
Improve CLI command decomposition.
openclaw status/openclaw logsshould avoid unnecessary status/bootstrap/provider/model work when the requested RPC itself is cheap.Reproduction / diagnostic hints
Useful probes:
logs.tailRPC time vs fullopenclaw logscommand wall time.ensureOpenClawModelsJson()in a fresh Node process with plugin profiling enabled.OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS=3000OPENCLAW_LIVE_GATEWAY_PROVIDERS=codexOPENCLAW_PLUGIN_LOAD_PROFILE=1Key expected signal: