Skip to content

fix: clear WS handshake timer early, increase timeouts#987

Open
BingqingLyu wants to merge 1 commit intomainfrom
fork-pr-49751-fix-ws-handshake-timeout-loopback-token-auth
Open

fix: clear WS handshake timer early, increase timeouts#987
BingqingLyu wants to merge 1 commit intomainfrom
fork-pr-49751-fix-ws-handshake-timeout-loopback-token-auth

Conversation

@BingqingLyu
Copy link
Copy Markdown
Owner

@BingqingLyu BingqingLyu commented Apr 27, 2026

Summary

  • Move clearHandshakeTimer() earlier in the gateway connect handler — right after validating the connect request, before any async auth work (device verification, token resolution, etc.)
  • Raise default timeouts: server handshake 3s->10s, client challenge 2s->10s

Problem

On resource-constrained hosts (small VPS, busy Node.js event loop), the gateway async auth flow (resolveConnectAuthState, device signature verification, token grant) can take longer than the 3-second DEFAULT_HANDSHAKE_TIMEOUT_MS. Because clearHandshakeTimer() is only called after the entire auth flow completes, the timer fires mid-auth and kills a legitimate in-progress connection with code 1000.

The client sends a valid connect request within ~50ms of WS open, the gateway receives it, starts processing auth — then the handshake timer fires 3 seconds after WS opened and closes the connection before auth finishes.

Gateway logs show:

[ws] handshake timeout conn=... remote=127.0.0.1
[ws] closed before connect conn=... code=1000 reason=n/a

Client sees:

Error: gateway closed (1000 normal closure): no close reason

Root cause

The handshake timer starts on WS open and races against the full async auth resolution. When auth takes >3s (due to event loop pressure, slow I/O, or heavy GC), the timer wins.

Timeline (broken):
  0ms    WS open -> handshake timer starts (3s)
  ~15ms  gateway sends connect.challenge
  ~50ms  client sends connect request -> gateway starts async auth
  3000ms handshake timer fires -> gateway sees !client -> close()
  ???ms  auth would have completed -> too late, connection dead

Fix

Separate "did the client send a valid connect request?" from "is the auth valid?". Clear the timer as soon as the connect request parses and validates, before entering async auth:

Timeline (fixed):
  0ms    WS open -> handshake timer starts (10s)
  ~15ms  gateway sends connect.challenge
  ~50ms  client sends connect request -> clearHandshakeTimer() -> async auth begins
  ???ms  auth completes -> session established

If auth fails, the connection is closed by the auth handler itself (with a proper error code), not by the handshake timer.

Files changed

File Change
src/gateway/server/ws-connection/message-handler.ts Move clearHandshakeTimer() from post-auth to pre-auth
src/gateway/server-constants.ts DEFAULT_HANDSHAKE_TIMEOUT_MS: 3s -> 10s
src/gateway/client.ts Client challenge timeout default: 2s -> 10s

Test plan

  • Existing gateway WS tests pass (timer moved, not removed — invalid/missing connect still times out)
  • openclaw gateway health succeeds on a resource-constrained host (2GB VPS, loopback + token auth)
  • openclaw cron list succeeds (requires operator.read scope via device identity)
  • Invalid handshake (malformed frame, wrong method) still closes promptly
  • Unauthenticated connect still rejected by auth handler (not the timer)

Fixes openclaw#46650
Fixes openclaw#48167

The gateway handshake timer fires before connect auth completes on
resource-constrained hosts (small VPS, busy event loop), because
clearHandshakeTimer() is called only after the full async auth flow
finishes. On systems where resolveConnectAuthState() or device
signature verification takes >3s, the timer kills a legitimate
in-progress connection with code 1000 "normal closure".

Move clearHandshakeTimer() to right after the connect request is
validated (parsed, method == connect, params OK) — before any async
auth work begins. This separates "did the client send a valid connect
request?" from "is the auth valid?", preventing the timer from racing
against auth resolution.

Also raise the default timeouts as a safety margin:
- Server handshake timeout: 3s → 10s
- Client challenge timeout: 2s → 10s (max 30s)

Fixes openclaw#46650
Fixes openclaw#48167

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants