Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery

# Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery

## Summary

The OpenClaw Gateway intermittently becomes unresponsive for tens of seconds. During these windows, WebSocket preauth handshakes time out, but the timeout callbacks fire much later than the configured 10s timeout. Observed values range from ~18s to ~72s.

This strongly suggests the Gateway Node.js event loop is being blocked or starved by synchronous / CPU-heavy work inside the gateway process, rather than a pure networking or client-side timeout issue.

The strongest current suspect is expensive model catalog regeneration / provider discovery after model/config hot reloads, especially around:

- `ensureOpenClawModelsJson()`
- `planOpenClawModelsJson()`
- `resolveProvidersForModelsJsonWithDeps()`
- `resolveImplicitProviders()`
- `resolveRuntimePluginDiscoveryProviders()`
- dynamic import of `plugins/provider-discovery.runtime.js`
- `resolvePluginDiscoveryProvidersRuntime()` / `resolvePluginProviders()`

This is an initial issue with current findings. I can add deeper traces if needed.

## Environment

- OS: Ubuntu 22.04.5 LTS, arm64
- Node.js: v24.14.1
- Gateway bound on LAN, local port `18789`
- Gateway process remains alive during incidents
- No obvious FD leak observed during checks
- Host not thermally throttled during later checks:
  - 4 cores
  - temperature around ~41°C
  - `vcgencmd get_throttled = 0x0`

## Symptoms

Observed in Gateway logs:

- WebSocket connections close before full connect/preauth.
- Important class: `cause=handshake-timeout`.
- Configured preauth handshake timeout is 10s.
- Actual logged `handshakeMs` / `durationMs` values were much larger:

Examples:

- `handshakeMs=47133`
- `handshakeMs=54995`
- `handshakeMs=49755`
- `handshakeMs=71389`
- `handshakeMs=18201`
- `handshakeMs=34302`

Because the timer callback itself fires late, this looks like event-loop starvation/blocking.

Additional symptoms seen during affected windows:

- Some Gateway RPCs occasionally slow:
  - `config.get` ~2.1–2.4s
  - `channels.status` ~2.0s
  - `config.schema.lookup` ~3.2s
  - `agent.wait` reached expected long wait timeout
- Telegram polling stalls were also seen near some incident windows.
- `openclaw status --json --timeout 1000` had ~14s wall time despite reported gateway connect latency around ~160ms.
- `openclaw logs --limit 5 --timeout 5000` had ~12s wall time, while the underlying `logs.tail` RPC later measured fast after warmup.

## Findings

### 1. Not likely to be a simple network / socket issue

The Gateway was alive and listening on `0.0.0.0:18789`.

HTTP/dashboard probes to `127.0.0.1:18789` were healthy when idle:

- first probe around ~285ms
- later probes around ~30ms

No obvious FD leak was observed:

- FD count around mid/high 30s
- threads around 11

### 2. Preauth timer firing late is the key evidence

Code inspection found:

- `DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS = 10000`
- server uses `getPreauthHandshakeTimeoutMsFromEnv()` for the preauth timer

But logs showed timeout callbacks firing after 18–72s. That strongly indicates the event loop was unable to run the timer callback on time.

### 3. Session/transcript sync reads are a smell, but probably not the sole root cause

Several gateway-adjacent paths use synchronous file reads/parsing:

- session store reads via `readSessionStoreReadOnly`
- transcript reads via `readSessionMessages()`
- some status/history/session preview paths

This is concerning, especially for large transcripts/checkpoints. However, microbenchmarks on the largest local artifacts showed individual parse costs mostly in the ~100–500ms range, not enough alone to explain 70s stalls.

So this remains a contributor/smell, but not the strongest single root cause.

### 4. Late handshake windows correlate with state churn

The worst late-handshake windows clustered around:

- config/model/default alias hot reloads
- provider/model catalog related changes
- pairing/device-state churn
- stuck session / embedded-run lifecycle events

This points toward bursts of gateway-side state recomputation or plugin/runtime work.

### 5. Stronger suspect: model catalog / provider discovery cold path

Local probes outside the gateway showed `ensureOpenClawModelsJson()` can be extremely expensive:

- importing `model-catalog-*.js` in a fresh Node process took ~6.5s
- `loadModelCatalog()` / `ensureOpenClawModelsJson()` without an outer provider-discovery timeout did not complete within test timeouts in some probes
- with:
  - `OPENCLAW_LIVE_GATEWAY=1`
  - `OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS=3000`
  - `OPENCLAW_LIVE_GATEWAY_PROVIDERS=codex`

  it completed in ~20.7s and logged that Codex provider catalog discovery timed out after 3000ms.

Filtered single-provider probes still showed high wall times:

- `anthropic-vertex`: ~13.9s
- `deepseek`: ~38.6s
- `codex` with 3s provider timeout: ~20.7s

This suggests the problem is not only one slow live provider catalog call. There appears to be significant plugin/runtime/provider setup overhead.

### 6. Provider filtering appears insufficient for the full path

Direct call:

```js
resolvePluginDiscoveryProvidersRuntime({ onlyPluginIds: ['deepseek'] })
```

behaved mostly as expected:

- entries-only returned only `deepseek` in ~81ms
- normal only-plugin run returned only `deepseek` in ~7.6s

But when running through `ensureOpenClawModelsJson()` with provider filtering, profiling still showed broad provider plugin loading, including:

- `anthropic`
- `byteplus`
- `deepseek`
- `moonshot`
- `openai`
- `tencent`
- `volcengine`
- `xai`

Even a minimized in-memory `cfg.models.providers = { deepseek: ... }` still showed broad provider plugin load/profile activity.

A later source-path refinement makes provider-specific policy helpers less likely to be the cold-load trigger: `normalizeProviderConfigPolicy()`, `applyProviderNativeStreamingUsagePolicy()`, and `resolveProviderConfigApiKeyPolicy()` call their plugin helpers with `allowRuntimePluginLoad: false`.

So the stronger suspect is the runtime discovery stage before/around normalization:

```text
resolveProvidersForModelsJsonWithDeps()
  -> resolveImplicitProviders()
  -> resolveRuntimePluginDiscoveryProviders()
  -> dynamic import of plugins/provider-discovery.runtime.js
  -> resolvePluginDiscoveryProvidersRuntime()
  -> resolvePluginProviders()
```

In particular, `resolvePluginDiscoveryProvidersRuntime()` has a selective full-plugin fallback when `onlyPluginIds` is undefined. If the provider filter is not passed through correctly from config/hot-reload context, or if source snapshot projection reintroduces providers, this can cause broad provider loading.

### 7. Warm cache is much faster, cold path is the problem

Within one process:

- first full provider discovery call: ~17.5s
- second call: ~1.3s
- third call: ~0.7s

So caching helps, but config/model hot reloads or cache invalidation can re-enter the expensive cold path.

## Current diagnosis

The Gateway can become temporarily unresponsive because model catalog regeneration / provider discovery performs expensive cold plugin/runtime work in or adjacent to latency-sensitive Gateway paths.

This can delay timers, WebSocket handshakes, CLI/RPC responses, Telegram polling, and possibly status/log commands.

The issue is likely not one single slow transcript read, but a combination of:

1. expensive cold provider/plugin runtime loading,
2. insufficient provider filtering in the full model-catalog regeneration path,
3. potentially unbounded or poorly bounded live provider discovery,
4. cache invalidation after model/config hot reloads,
5. work running on the Gateway event loop instead of being deferred/offloaded.

## Proposed fixes / mitigations

### Short-term mitigations

1. Add a bounded default timeout for live provider discovery in Gateway contexts.
   - Example: default `OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS` to a small value when running inside gateway.
   - Avoid unbounded provider catalog resolution during hot reload.

2. Avoid running full `ensureOpenClawModelsJson()` synchronously after config/model hot reload.
   - Defer regeneration.
   - Serve stale cached catalog while refresh happens in background.
   - Emit a warning if refresh fails or times out.

3. Ensure provider discovery filters are passed through the full path.
   - In particular, check what `onlyPluginIds` is passed into `resolveRuntimePluginDiscoveryProviders()` during `ensureOpenClawModelsJson()` after config/model hot reloads.
   - Avoid broad `resolvePluginProviders()` when a narrowed provider set is known.

4. Add method-level timing logs around:
   - `applyHotReload()`
   - `resetModelCatalogCache()`
   - `ensureOpenClawModelsJson()`
   - `planOpenClawModelsJson()`
   - `resolveProvidersForModelsJsonWithDeps()`
   - `resolveImplicitProviders()`
   - `resolveRuntimePluginDiscoveryProviders()`
   - `resolvePluginDiscoveryProvidersRuntime()`
   - `resolvePluginProviders()`
   - `readSessionMessages()`

5. Add runtime event-loop-delay instrumentation in the Gateway.
   - For example, log when event-loop delay exceeds 1s / 5s / 10s, including the current high-level operation if known.

### Longer-term fixes

1. Move expensive model catalog/provider discovery work out of the Gateway request/event-loop path.
   - Use a worker thread or child process.
   - Keep Gateway responsive while catalog refresh happens.

2. Make provider discovery fully cache-aware and cancellation-aware.
   - Timeouts should abort work cleanly, not merely stop waiting while leaving expensive tasks/resources alive.

3. Separate static provider config normalization from live provider discovery.
   - Normalization should not require broad plugin loading if static config is already known.

4. Improve CLI command decomposition.
   - Commands like `openclaw status` / `openclaw logs` should avoid unnecessary status/bootstrap/provider/model work when the requested RPC itself is cheap.

## Reproduction / diagnostic hints

Useful probes:

- Compare `logs.tail` RPC time vs full `openclaw logs` command wall time.
- Trigger model/default alias config hot reload, then watch for late WS handshake timers.
- Run `ensureOpenClawModelsJson()` in a fresh Node process with plugin profiling enabled.
- Test with and without provider discovery timeout:
  - `OPENCLAW_LIVE_PROVIDER_DISCOVERY_TIMEOUT_MS=3000`
  - `OPENCLAW_LIVE_GATEWAY_PROVIDERS=codex`
  - `OPENCLAW_PLUGIN_LOAD_PROFILE=1`

Key expected signal:

- A 10s WebSocket preauth timer firing after much more than 10s is the clearest symptom of event-loop delay.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery #74135

Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery

Summary

Environment

Symptoms

Findings

1. Not likely to be a simple network / socket issue

2. Preauth timer firing late is the key evidence

3. Session/transcript sync reads are a smell, but probably not the sole root cause

4. Late handshake windows correlate with state churn

5. Stronger suspect: model catalog / provider discovery cold path

6. Provider filtering appears insufficient for the full path

7. Warm cache is much faster, cold path is the problem

Current diagnosis

Proposed fixes / mitigations

Short-term mitigations

Longer-term fixes

Reproduction / diagnostic hints

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery #74135

Description

Gateway intermittently stalls: WebSocket preauth handshakes time out late during model catalog/provider discovery

Summary

Environment

Symptoms

Findings

1. Not likely to be a simple network / socket issue

2. Preauth timer firing late is the key evidence

3. Session/transcript sync reads are a smell, but probably not the sole root cause

4. Late handshake windows correlate with state churn

5. Stronger suspect: model catalog / provider discovery cold path

6. Provider filtering appears insufficient for the full path

7. Warm cache is much faster, cold path is the problem

Current diagnosis

Proposed fixes / mitigations

Short-term mitigations

Longer-term fixes

Reproduction / diagnostic hints

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions