Skip to content

Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery #74135

@DerFlash

Description

@DerFlash

Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery

Summary

The OpenClaw Gateway intermittently becomes unresponsive for tens of seconds. During these windows, WebSocket preauth handshakes time out, but the timeout callbacks fire much later than the configured 10s timeout. Observed values range from ~18s to ~72s.

This strongly suggests the Gateway Node.js event loop is being blocked or starved by synchronous / CPU-heavy work inside the gateway process, rather than a pure networking or client-side timeout issue.

The strongest current suspect is expensive model catalog regeneration / provider discovery after model/config hot reloads, especially around:

  • ensureOpenClawModelsJson()
  • planOpenClawModelsJson()
  • resolveProvidersForModelsJsonWithDeps()
  • resolveImplicitProviders()
  • resolveRuntimePluginDiscoveryProviders()
  • dynamic import of plugins/provider-discovery.runtime.js
  • resolvePluginDiscoveryProvidersRuntime() / resolvePluginProviders()

This is an initial issue with current findings. I can add deeper traces if needed.

Environment

  • OS: Ubuntu 22.04.5 LTS, arm64
  • Node.js: v24.14.1
  • Gateway bound on LAN, local port 18789
  • Gateway process remains alive during incidents
  • No obvious FD leak observed during checks
  • Host not thermally throttled during later checks:
    • 4 cores
    • temperature around ~41°C
    • vcgencmd get_throttled = 0x0

Symptoms

Observed in Gateway logs:

  • WebSocket connections close before full connect/preauth.
  • Important class: cause=handshake-timeout.
  • Configured preauth handshake timeout is 10s.
  • Actual logged handshakeMs / durationMs values were much larger:

Examples:

  • handshakeMs=47133
  • handshakeMs=54995
  • handshakeMs=49755
  • handshakeMs=71389
  • handshakeMs=18201
  • handshakeMs=34302

Because the timer callback itself fires late, this looks like event-loop starvation/blocking.

Additional symptoms seen during affected windows:

  • Some Gateway RPCs occasionally slow:
    • config.get ~2.1–2.4s
    • channels.status ~2.0s
    • config.schema.lookup ~3.2s
    • agent.wait reached expected long wait timeout
  • Telegram polling stalls were also seen near some incident windows.
  • openclaw status --json --timeout 1000 had ~14s wall time despite reported gateway connect latency around ~160ms.
  • openclaw logs --limit 5 --timeout 5000 had ~12s wall time, while the underlying logs.tail RPC later measured fast after warmup.

Findings

1. Not likely to be a simple network / socket issue

The Gateway was alive and listening on 0.0.0.0:18789.

HTTP/dashboard probes to 127.0.0.1:18789 were healthy when idle:

  • first probe around ~285ms
  • later probes around ~30ms

No obvious FD leak was observed:

  • FD count around mid/high 30s
  • threads around 11

2. Preauth timer firing late is the key evidence

Code inspection found:

  • DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS = 10000
  • server uses getPreauthHandshakeTimeoutMsFromEnv() for the preauth timer

But logs showed timeout callbacks firing after 18–72s. That strongly indicates the event loop was unable to run the timer callback on time.

3. Session/transcript sync reads are a smell, but probably not the sole root cause

Several gateway-adjacent paths use synchronous file reads/parsing:

  • session store reads via readSessionStoreReadOnly
  • transcript reads via readSessionMessages()
  • some status/history/session preview paths

This is concerning, especially for large transcripts/checkpoints. However, microbenchmarks on the largest local artifacts showed individual parse costs mostly in the ~100–500ms range, not enough alone to explain 70s stalls.

So this remains a contributor/smell, but not the strongest single root cause.

4. Late handshake windows correlate with state churn

The worst late-handshake windows clustered around:

  • config/model/default alias hot reloads
  • provider/model catalog related changes
  • pairing/device-state churn
  • stuck session / embedded-run lifecycle events

This points toward bursts of gateway-side state recomputation or plugin/runtime work.

5. Stronger suspect: model catalog / provider discovery cold path

Local probes outside the gateway showed ensureOpenClawModelsJson() can be extremely expensive:

  • importing model-catalog-*.js in a fresh Node process took ~6.5s

  • loadModelCatalog() / ensureOpenClawModelsJson() without an outer provider-discovery timeout did not complete within test timeouts in some probes

  • with:

    • OPENCLAW_LIVE_GATEWAY=1
    • OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS=3000
    • OPENCLAW_LIVE_GATEWAY_PROVIDERS=codex

    it completed in ~20.7s and logged that Codex provider catalog discovery timed out after 3000ms.

Filtered single-provider probes still showed high wall times:

  • anthropic-vertex: ~13.9s
  • deepseek: ~38.6s
  • codex with 3s provider timeout: ~20.7s

This suggests the problem is not only one slow live provider catalog call. There appears to be significant plugin/runtime/provider setup overhead.

6. Provider filtering appears insufficient for the full path

Direct call:

resolvePluginDiscoveryProvidersRuntime({ onlyPluginIds: ['deepseek'] })

behaved mostly as expected:

  • entries-only returned only deepseek in ~81ms
  • normal only-plugin run returned only deepseek in ~7.6s

But when running through ensureOpenClawModelsJson() with provider filtering, profiling still showed broad provider plugin loading, including:

  • anthropic
  • byteplus
  • deepseek
  • moonshot
  • openai
  • tencent
  • volcengine
  • xai

Even a minimized in-memory cfg.models.providers = { deepseek: ... } still showed broad provider plugin load/profile activity.

A later source-path refinement makes provider-specific policy helpers less likely to be the cold-load trigger: normalizeProviderConfigPolicy(), applyProviderNativeStreamingUsagePolicy(), and resolveProviderConfigApiKeyPolicy() call their plugin helpers with allowRuntimePluginLoad: false.

So the stronger suspect is the runtime discovery stage before/around normalization:

resolveProvidersForModelsJsonWithDeps()
  -> resolveImplicitProviders()
  -> resolveRuntimePluginDiscoveryProviders()
  -> dynamic import of plugins/provider-discovery.runtime.js
  -> resolvePluginDiscoveryProvidersRuntime()
  -> resolvePluginProviders()

In particular, resolvePluginDiscoveryProvidersRuntime() has a selective full-plugin fallback when onlyPluginIds is undefined. If the provider filter is not passed through correctly from config/hot-reload context, or if source snapshot projection reintroduces providers, this can cause broad provider loading.

7. Warm cache is much faster, cold path is the problem

Within one process:

  • first full provider discovery call: ~17.5s
  • second call: ~1.3s
  • third call: ~0.7s

So caching helps, but config/model hot reloads or cache invalidation can re-enter the expensive cold path.

Current diagnosis

The Gateway can become temporarily unresponsive because model catalog regeneration / provider discovery performs expensive cold plugin/runtime work in or adjacent to latency-sensitive Gateway paths.

This can delay timers, WebSocket handshakes, CLI/RPC responses, Telegram polling, and possibly status/log commands.

The issue is likely not one single slow transcript read, but a combination of:

  1. expensive cold provider/plugin runtime loading,
  2. insufficient provider filtering in the full model-catalog regeneration path,
  3. potentially unbounded or poorly bounded live provider discovery,
  4. cache invalidation after model/config hot reloads,
  5. work running on the Gateway event loop instead of being deferred/offloaded.

Proposed fixes / mitigations

Short-term mitigations

  1. Add a bounded default timeout for live provider discovery in Gateway contexts.

    • Example: default OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS to a small value when running inside gateway.
    • Avoid unbounded provider catalog resolution during hot reload.
  2. Avoid running full ensureOpenClawModelsJson() synchronously after config/model hot reload.

    • Defer regeneration.
    • Serve stale cached catalog while refresh happens in background.
    • Emit a warning if refresh fails or times out.
  3. Ensure provider discovery filters are passed through the full path.

    • In particular, check what onlyPluginIds is passed into resolveRuntimePluginDiscoveryProviders() during ensureOpenClawModelsJson() after config/model hot reloads.
    • Avoid broad resolvePluginProviders() when a narrowed provider set is known.
  4. Add method-level timing logs around:

    • applyHotReload()
    • resetModelCatalogCache()
    • ensureOpenClawModelsJson()
    • planOpenClawModelsJson()
    • resolveProvidersForModelsJsonWithDeps()
    • resolveImplicitProviders()
    • resolveRuntimePluginDiscoveryProviders()
    • resolvePluginDiscoveryProvidersRuntime()
    • resolvePluginProviders()
    • readSessionMessages()
  5. Add runtime event-loop-delay instrumentation in the Gateway.

    • For example, log when event-loop delay exceeds 1s / 5s / 10s, including the current high-level operation if known.

Longer-term fixes

  1. Move expensive model catalog/provider discovery work out of the Gateway request/event-loop path.

    • Use a worker thread or child process.
    • Keep Gateway responsive while catalog refresh happens.
  2. Make provider discovery fully cache-aware and cancellation-aware.

    • Timeouts should abort work cleanly, not merely stop waiting while leaving expensive tasks/resources alive.
  3. Separate static provider config normalization from live provider discovery.

    • Normalization should not require broad plugin loading if static config is already known.
  4. Improve CLI command decomposition.

    • Commands like openclaw status / openclaw logs should avoid unnecessary status/bootstrap/provider/model work when the requested RPC itself is cheap.

Reproduction / diagnostic hints

Useful probes:

  • Compare logs.tail RPC time vs full openclaw logs command wall time.
  • Trigger model/default alias config hot reload, then watch for late WS handshake timers.
  • Run ensureOpenClawModelsJson() in a fresh Node process with plugin profiling enabled.
  • Test with and without provider discovery timeout:
    • OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS=3000
    • OPENCLAW_LIVE_GATEWAY_PROVIDERS=codex
    • OPENCLAW_PLUGIN_LOAD_PROFILE=1

Key expected signal:

  • A 10s WebSocket preauth timer firing after much more than 10s is the clearest symptom of event-loop delay.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions