Skip to content

CLI gateway handshake timeout too short for cold-start module compilation #51469

@GothicFox

Description

@GothicFox

Summary

CLI commands that connect to the gateway (e.g. openclaw cron list) fail with gateway closed (1000 normal closure): no close reason on systems where Node.js ESM module compilation takes longer than the gateway's 3-second WebSocket handshake timeout.

Environment

  • OpenClaw: 2026.3.13 (61d171a)
  • Node.js: v22.22.1
  • OS: Ubuntu Linux (systemd service)
  • Gateway mode: local (loopback)

Steps to Reproduce

  1. Install openclaw, configure gateway in local mode
  2. Run openclaw cron list (or any CLI command that connects to the gateway)
  3. Observe the error:
gateway connect failed: Error: gateway closed (1000):
Error: gateway closed (1000 normal closure): no close reason
Gateway target: ws://127.0.0.1:18789

Note: openclaw gateway status (which uses RPC probe, not WebSocket handshake) works fine. Webchat also works fine.

Root Cause Analysis

Through debugging, the following timeline was identified:

  1. T+0ms: CLI creates WebSocket connection to gateway
  2. T+~1ms: Gateway accepts TCP connection, sends connect.challenge event
  3. T+~8ms: Node.js setImmediate fires — event loop appears free
  4. T+3000ms: Gateway's handshake timer expires (3s default), closes WebSocket with code 1000
  5. T+~12000ms: CLI's event loop finally processes WebSocket open and message events
  6. CLI calls sendConnect()request("connect") → finds WebSocket already in CLOSING state → error

The ~12-second gap between WebSocket creation and event processing is caused by Node.js ESM module compilation blocking the event loop. The CLI's large bundled dist files (with 41 dynamic import() calls) take significant time to compile on cold start.

The gateway's DEFAULT_HANDSHAKE_TIMEOUT_MS = 3000 (in src/gateway/server-constants.ts) is insufficient for CLI clients that experience this cold-start delay.

Key evidence:

  • Standalone WebSocket test connects in ~120ms (no module compilation overhead)
  • Inside the CLI process, the same connection takes ~12 seconds
  • setInterval(100ms) set right after client.start() doesn't fire its first tick until +12 seconds later
  • Webchat is unaffected because browser JS is pre-compiled/bundled

Suggested Fix

  1. Increase the default handshake timeout from 3s to at least 15s (CLI cold start can take 12+ seconds)
  2. Make the timeout configurable via openclaw.json:
{
  "gateway": {
    "handshakeTimeoutMs": 15000
  }
}
  1. Allow env var override (without requiring VITEST):
const getHandshakeTimeoutMs = () => {
  // Config file
  const configValue = config.gateway?.handshakeTimeoutMs;
  if (typeof configValue === 'number' && configValue > 0) return configValue;
  // Env var override
  const envValue = Number(process.env.OPENCLAW_HANDSHAKE_TIMEOUT_MS);
  if (Number.isFinite(envValue) && envValue > 0) return envValue;
  // Default
  return DEFAULT_HANDSHAKE_TIMEOUT_MS; // 15000
};

Current Workaround

Patching the compiled dist files (gateway-cli-*.js) to change DEFAULT_HANDSHAKE_TIMEOUT_MS from 3e3 to 15e3, then restarting the gateway. This fix is lost on every openclaw update.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions