Summary
When the gateway event loop is busy (processing agent turns, compaction, or concurrent sessions), the 3-second WebSocket handshake timeout (DEFAULT_HANDSHAKE_TIMEOUT_MS = 3e3 in gateway-cli-*.js) fires before the connect challenge completes. The gateway closes the connection with code 1000 (normal closure), and the CLI reports:
gateway connect failed: Error: gateway closed (1000):
This affects all CLI-to-gateway WS calls, including read-only operations like openclaw cron list.
Environment
- OpenClaw: 2026.3.13 (61d171a)
- Host: macOS 26.3.1, Apple Silicon Mac mini, Node v24.14.0
- Gateway config:
maxConcurrent: 4, loopback bind
Steps to Reproduce
- Run a gateway with multiple concurrent agent sessions (3-4 active)
- From a cron job or external script, run
openclaw cron list --json while the gateway is processing agent turns
- The CLI connects via WS but the gateway's handshake challenge isn't answered within 3 seconds
- Gateway closes the WS with code 1000, CLI reports failure
This is intermittent — depends on event loop pressure at the exact moment of connection.
Root Cause
DEFAULT_HANDSHAKE_TIMEOUT_MS is hardcoded to 3e3 (3 seconds) in the gateway:
// gateway-cli-*.js line ~7586
const DEFAULT_HANDSHAKE_TIMEOUT_MS = 3e3;
const getHandshakeTimeoutMs = () => {
if (process.env.VITEST && process.env.OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS) {
const parsed = Number(process.env.OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS);
if (Number.isFinite(parsed) && parsed > 0) return parsed;
}
return DEFAULT_HANDSHAKE_TIMEOUT_MS;
};
The env var override (OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS) is gated behind process.env.VITEST, making it test-only.
Why This Surfaced in v2026.3.13
The v2026.3.13 fix "Gateway/client requests: reject unanswered gateway RPC calls after a bounded timeout" introduced active rejection of stalled connections. In v2026.3.12, busy handshakes would hang indefinitely (the CLI's own subprocess timeout would handle it). Now the gateway actively closes them, surfacing the 3s limit as a user-visible failure.
Suggested Fix
- Increase default from 3s to ~10s — 3s is too tight for a local loopback connection when the event loop is under load
- Make it user-configurable via
gateway.handshakeTimeoutMs in openclaw.json (or similar config key)
- Remove the VITEST gate on
OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS so users can override via env var as a stopgap
Workaround
Monkey-patch the installed package:
sed -i 's/const DEFAULT_HANDSHAKE_TIMEOUT_MS = 3e3;/const DEFAULT_HANDSHAKE_TIMEOUT_MS = 10e3;/' \
$(dirname $(which openclaw))/../lib/node_modules/openclaw/dist/gateway-cli-*.js
# Then restart gateway
Gateway Log Evidence
{
"cause": "handshake-timeout",
"handshake": "failed",
"durationMs": 3908,
"handshakeMs": 3002,
"host": "127.0.0.1:18789",
"code": 1000,
"reason": "n/a"
}
Observed ~34 failures over 18 hours with the same pattern — always handshakeMs: 3002.
Summary
When the gateway event loop is busy (processing agent turns, compaction, or concurrent sessions), the 3-second WebSocket handshake timeout (
DEFAULT_HANDSHAKE_TIMEOUT_MS = 3e3ingateway-cli-*.js) fires before the connect challenge completes. The gateway closes the connection with code 1000 (normal closure), and the CLI reports:This affects all CLI-to-gateway WS calls, including read-only operations like
openclaw cron list.Environment
maxConcurrent: 4, loopback bindSteps to Reproduce
openclaw cron list --jsonwhile the gateway is processing agent turnsThis is intermittent — depends on event loop pressure at the exact moment of connection.
Root Cause
DEFAULT_HANDSHAKE_TIMEOUT_MSis hardcoded to3e3(3 seconds) in the gateway:The env var override (
OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MS) is gated behindprocess.env.VITEST, making it test-only.Why This Surfaced in v2026.3.13
The v2026.3.13 fix "Gateway/client requests: reject unanswered gateway RPC calls after a bounded timeout" introduced active rejection of stalled connections. In v2026.3.12, busy handshakes would hang indefinitely (the CLI's own subprocess timeout would handle it). Now the gateway actively closes them, surfacing the 3s limit as a user-visible failure.
Suggested Fix
gateway.handshakeTimeoutMsinopenclaw.json(or similar config key)OPENCLAW_TEST_HANDSHAKE_TIMEOUT_MSso users can override via env var as a stopgapWorkaround
Monkey-patch the installed package:
Gateway Log Evidence
{ "cause": "handshake-timeout", "handshake": "failed", "durationMs": 3908, "handshakeMs": 3002, "host": "127.0.0.1:18789", "code": 1000, "reason": "n/a" }Observed ~34 failures over 18 hours with the same pattern — always
handshakeMs: 3002.