fix: clear WS handshake timer early, increase timeouts#987
Open
BingqingLyu wants to merge 1 commit intomainfrom
Open
fix: clear WS handshake timer early, increase timeouts#987BingqingLyu wants to merge 1 commit intomainfrom
BingqingLyu wants to merge 1 commit intomainfrom
Conversation
The gateway handshake timer fires before connect auth completes on resource-constrained hosts (small VPS, busy event loop), because clearHandshakeTimer() is called only after the full async auth flow finishes. On systems where resolveConnectAuthState() or device signature verification takes >3s, the timer kills a legitimate in-progress connection with code 1000 "normal closure". Move clearHandshakeTimer() to right after the connect request is validated (parsed, method == connect, params OK) — before any async auth work begins. This separates "did the client send a valid connect request?" from "is the auth valid?", preventing the timer from racing against auth resolution. Also raise the default timeouts as a safety margin: - Server handshake timeout: 3s → 10s - Client challenge timeout: 2s → 10s (max 30s) Fixes openclaw#46650 Fixes openclaw#48167 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
clearHandshakeTimer()earlier in the gateway connect handler — right after validating the connect request, before any async auth work (device verification, token resolution, etc.)Problem
On resource-constrained hosts (small VPS, busy Node.js event loop), the gateway async auth flow (
resolveConnectAuthState, device signature verification, token grant) can take longer than the 3-secondDEFAULT_HANDSHAKE_TIMEOUT_MS. BecauseclearHandshakeTimer()is only called after the entire auth flow completes, the timer fires mid-auth and kills a legitimate in-progress connection with code 1000.The client sends a valid
connectrequest within ~50ms of WS open, the gateway receives it, starts processing auth — then the handshake timer fires 3 seconds after WS opened and closes the connection before auth finishes.Gateway logs show:
Client sees:
Root cause
The handshake timer starts on WS open and races against the full async auth resolution. When auth takes >3s (due to event loop pressure, slow I/O, or heavy GC), the timer wins.
Fix
Separate "did the client send a valid connect request?" from "is the auth valid?". Clear the timer as soon as the connect request parses and validates, before entering async auth:
If auth fails, the connection is closed by the auth handler itself (with a proper error code), not by the handshake timer.
Files changed
src/gateway/server/ws-connection/message-handler.tsclearHandshakeTimer()from post-auth to pre-authsrc/gateway/server-constants.tsDEFAULT_HANDSHAKE_TIMEOUT_MS: 3s -> 10ssrc/gateway/client.tsTest plan
openclaw gateway healthsucceeds on a resource-constrained host (2GB VPS, loopback + token auth)openclaw cron listsucceeds (requires operator.read scope via device identity)Fixes openclaw#46650
Fixes openclaw#48167