fix(gateway): use default handshake timeout in wrapper, expose event-loop drift on timeout#85424
Conversation
|
Codex review: needs changes before merge. Latest ClawSweeper review: 2026-05-24 13:24 UTC / May 24, 2026, 9:24 AM ET. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
PR Surface View PR surface stats
Summary Reproducibility: yes. Current main and the latest release show a 10s unconfigured fallback while the source/docs default is 15s, and the contributor supplied redacted pre/post terminal proof from a real gateway path. PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review findings
Review detailsBest possible solution: Land the scoped timeout/default-reachability and drift-diagnostic change after removing unrelated formatter-only files and keeping the documented 15s default unchanged. Do we have a high-confidence way to reproduce the issue? Yes. Current main and the latest release show a 10s unconfigured fallback while the source/docs default is 15s, and the contributor supplied redacted pre/post terminal proof from a real gateway path. Is this the best way to solve the issue? Yes for the core behavior. Deferring omitted CLI timeouts to Label justifications:
Full review comments:
Overall correctness: patch is correct Acceptance criteria:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 5be62e779b2e. |
There was a problem hiding this comment.
Pull request overview
This PR fixes gateway handshake timeouts being unintentionally capped at 10s (especially via CLI wrappers), raises the default preauth handshake timeout to better tolerate slow-startup environments, and propagates event-loop drift diagnostics into timeout errors for better operator attribution.
Changes:
- Raise
DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MSto 45s and make the default reachable inresolveGatewayCallTimeout(removing the silent 10s fallback path). - Update multiple CLI wrappers to omit a hardcoded
--timeoutdefault and forwardundefinedwhen--timeoutis not provided, allowing env/config/default timeout resolution to work as intended. - Attach
eventLoopMaxDriftMsto timeout transport errors (and JSON shape) and add a conditional human hint when drift is large; add/adjust tests accordingly.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/gateway/handshake-timeouts.ts | Raises default preauth handshake timeout to 45s and documents rationale. |
| src/gateway/handshake-timeouts.test.ts | Updates tests to be robust against future default timeout bumps. |
| src/gateway/event-loop-ready.ts | Raises readiness max-wait default to 45s (intended to align with handshake). |
| src/gateway/call.ts | Fixes default timeout resolution path; adds eventLoopMaxDriftMs to timeout errors + JSON; adds drift-based hint to timeout message. |
| src/gateway/call.test.ts | Adds regressions for default timeout reachability and drift propagation/hinting. |
| src/cli/nodes-cli/rpc.ts | Removes hardcoded 10s --timeout default unless an explicit per-command default is provided. |
| src/cli/nodes-cli/rpc.runtime.ts | Forwards undefined timeout when --timeout is omitted so gateway-side/default resolution applies. |
| src/cli/gateway-rpc.runtime.ts | Stops hardcoding 10s timeout; forwards undefined when --timeout omitted. |
| src/cli/gateway-cli/register.ts | Removes hardcoded --timeout default; updates help text to reflect handshake-budget defaulting. |
| src/cli/gateway-cli/call.ts | Removes hardcoded --timeout default; forwards undefined when omitted. |
| src/cli/devices-cli.ts | Removes hardcoded 10s --timeout default unless per-command default is supplied; updates help text. |
| src/cli/devices-cli.runtime.ts | Stops defaulting to 10s in runtime; only forwards explicit timeout and always includes timeout in re-run command if provided. |
| `\n\nNote: this CLI process's event loop was blocked for ` + | ||
| `${Math.round(eventLoopMaxDriftMs)}ms during the handshake, ` + | ||
| "which usually means the timeout fired because the CLI was busy " + | ||
| "(heavy module discovery, JIT compile, sync I/O) rather than because " + | ||
| "the gateway is down. The gateway's response may have arrived just " + | ||
| "after the timer fired. Try raising the budget with " + | ||
| "`OPENCLAW_HANDSHAKE_TIMEOUT_MS=<ms>` or by setting " + | ||
| "`gateway.handshakeTimeoutMs` in openclaw.json."; |
| // Aligned with DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS so the readiness wait and | ||
| // the outer handshake timer give up on roughly the same deadline. A 10 s wait | ||
| // is not long enough to ride out a slow CLI startup on lower-end x86_64 hosts | ||
| // with many gateway plugins installed (observed: ~30 s of event-loop blocking | ||
| // from module discovery / JIT compile during the first `openclaw devices list` | ||
| // after a cold launcher start). See handshake-timeouts.ts for the discussion. | ||
| const DEFAULT_MAX_WAIT_MS = 45_000; | ||
| const DEFAULT_INTERVAL_MS = 1; |
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Frosted Proofling Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
|
Thanks for the careful review. Updating the PR to address both contributor asks. Scope change: reverting the default-timeout bumpClawSweeper flagged a docs/type/protocol-contract mismatch (the PR raised I'm reverting that part of the change. After reconsidering, the actual root cause is what's described in the PR title — the documented 15 s default was unreachable from the CLI wrappers and from The slow-startup host I observed (~22 s of CLI-side event-loop blocking that pushed the handshake past 15 s) is symptomatic of a separate problem in CLI plugin/discovery startup, not of the gateway-side budget being too tight. I'm tracking that independently; it shouldn't drive a change to the shared preauth default. A new commit on this branch reverts Real behavior proofClawSweeper asked for inspectable after-fix terminal output. Here it is. Private LAN IPv4 addresses and per-device public-key fingerprints are redacted; loopback IPs, role/scope/token columns, and timing markers are preserved. Pre-fix (current upstream/main):
|
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
… expose event-loop drift
resolveGatewayCallTimeout silently fell back to a hardcoded 10_000 ms whenever
neither OPENCLAW_HANDSHAKE_TIMEOUT_MS nor gateway.handshakeTimeoutMs was set,
which made DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS unreachable in this code path.
On lower-end x86_64 hosts with many gateway plugins installed, the CLI's own
event loop is blocked by module discovery / JIT compile during the first
openclaw devices list for ~30 s — well past the 10 s budget — producing
misleading "gateway timeout after 10000ms" errors even though the gateway
itself responded in ~15 ms.
Repro on envy.lan (Linux 6.8 x86_64, Node 24.15.0, 45 plugins):
[WSD t= 0] MODULE-LOAD client-CBm35uE_.js
[WSD t= 912] open
[WSD t= 936] sendConnect-entry
[WSD t=29425] HEARTBEAT-LATE dt=28536ms (event-loop was blocked 28.5 s)
[WSD t=29284] message hello-ok bytes=8472
stderr: "gateway timeout after 10000ms"
Same dist on ULTRON (aarch64 RPi, Node 22.22.2): hello-ok at t=105 ms, exit=0.
Changes:
* Drop the env/config gate in resolveGatewayCallTimeout so the default
actually applies. Also drop the >10_000 floor check that masked the bug.
* Raise DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS 15_000 -> 45_000 and align
DEFAULT_MAX_WAIT_MS in event-loop-ready.ts. The timeout never fires on
healthy systems (handshakes complete in tens of ms), so this trades a
small worst-case failure-detection latency for a large reduction in
spurious "gateway timeout" errors.
* Thread EventLoopReadyResult.maxDriftMs into the timeout error
(eventLoopMaxDriftMs field on GatewayTransportError + JSON shape) and
add a one-line hint in the error message when drift >= 1_000 ms so
operators stop blaming the gateway for what was actually a wedged CLI.
* Three regression tests in call.test.ts (default reachability, drift
propagation, message-hint threshold). Two existing handshake-timeouts
assertions updated to reference the new default via constant.
Test plan:
- pnpm vitest run src/gateway/{call,handshake-timeouts,event-loop-ready,client-start-readiness}.test.ts
- 298 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…appers The CLI wrappers `gateway-cli/call.ts`, `gateway-cli/register.ts`, `gateway-rpc.runtime.ts`, and `nodes-cli/rpc.runtime.ts` registered the `--timeout <ms>` Commander option with an explicit default of "10000" *and* passed `Number(opts.timeout ?? 10_000)` to `callGateway`, which silently overrode the gateway handshake budget for every CLI invocation that did not pass `--timeout` explicitly. That made the previous fix (9652076) reachable only via env/config rather than as the resolved default for ordinary CLI calls — `openclaw devices list` and other WS-protocol subcommands still hit the old 10 s ceiling. Removes the Commander default and switches the runtime forwarders to pass `undefined` when `--timeout` was not supplied, so callGateway can resolve `DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS` (or the env/config override) the same way other callers do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…k in approval rerun command The previous commit removed `const DEFAULT_DEVICES_TIMEOUT_MS = 10_000;` from devices-cli.runtime.ts but left one reference behind in `buildExplicitApproveCommand`, which constructs the "rerun" command line for `openclaw devices approve` when no pending requestId is supplied. That dangling reference threw `ReferenceError: DEFAULT_DEVICES_TIMEOUT_MS is not defined` whenever the rerun-command builder ran. The check `timeout !== String(DEFAULT_DEVICES_TIMEOUT_MS)` was a de-duplication step that omitted `--timeout` from the rerun command when its value equalled the (now removed) hardcoded default. With no numeric default left, the simpler rule is correct: include `--timeout <value>` whenever the user supplied one, otherwise omit it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wrapper/resolver fixes earlier in this branch (9652076 plus the two CLI commits) already restore reachability of the documented 15s default. The further bump to 45s was speculative scope that introduced a contract mismatch with docs/gateway/configuration-reference.md and the GatewayConfig.handshakeTimeoutMs JSDoc, and stretched the preauth socket lifetime 3x without an explicit maintainer ack. The observed slow-startup CLI symptom that motivated the bump (~22 s of event-loop blocking during plugin discovery on one host) is a CLI-side problem and is being investigated separately; it should not drive a change to the shared preauth handshake default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ze drift wording ClawSweeper re-review flagged three P3 items after proof cleared: - `src/cli/devices-cli.ts` and `src/cli/gateway-cli/register.ts` still had parenthetical "(45 000 ms)" / "(45 000 ms by default)" comments referring to a default this branch no longer ships. Removed the concrete number rather than substituting 15 000 ms, per the reviewer's suggestion to avoid pinning a value that can drift again. - `src/gateway/call.ts` event-loop drift hint hardcoded "this CLI process's event loop" / "the CLI was busy", but `callGateway` and `GatewayTransportError` are emitted by non-CLI callers too (backend, runtime). Reworded to "this process's event loop" / "the caller was busy" so the message is correct for every caller. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClawSweeper re-review flagged that nine files in the branch differ from upstream/main only by auto-formatter line-wrap output (function signatures and call sites expanded to multi-line; no behavior change). The churn predates the current proof/cleanup work — likely an oxfmt --write pass during initial branch construction — and is unrelated to the gateway handshake-timeout fix this PR is meant to land. Reset these files to upstream/main so the PR diff stays scoped to the timeout fix and the drift-diagnostic change: - extensions/codex/src/app-server/config.test.ts - extensions/codex/src/app-server/transport-stdio.test.ts - scripts/generate-npm-shrinkwrap.mjs - src/agents/model-auth.ts - src/agents/pi-embedded-runner/model.ts - src/gateway/server-methods/update-managed-service-handoff.test.ts - src/plugins/current-plugin-metadata-snapshot.ts - src/plugins/provider-runtime.ts - test/scripts/install-ps1.test.ts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d05c901 to
69cfecc
Compare
This comment was marked as spam.
This comment was marked as spam.
|
Closing for now. This was more a quality-of-life / observability improvement than a fix for something I'm actively hitting — the default handshake behavior is working fine for me in practice. Withdrawing to keep my open-PR set tidy while |
Summary
Repairs two related bugs that together produced "gateway timeout after
10000ms" errors on slow-startup CLI environments even when the gateway
itself responded in tens of milliseconds:
src/gateway/call.tsresolveGatewayCallTimeoutsilently fell backto a hardcoded 10 000 ms whenever neither
OPENCLAW_HANDSHAKE_TIMEOUT_MSnor
gateway.handshakeTimeoutMswas set —DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS(15 000) was unreachable inthat path. Raised the default to 45 000 ms to absorb slow-startup
environments and dropped the silent-fallback gate.
gateway-cli/call.ts,gateway-cli/register.ts,gateway-rpc.runtime.ts,nodes-cli/rpc.ts,nodes-cli/rpc.runtime.ts,devices-cli.ts,devices-cli.runtime.ts) each registered the--timeout <ms>Commander option with an explicit default of
"10000"and passedNumber(opts.timeout ?? 10_000)tocallGateway, which silentlycapped every CLI invocation at the old 10 s ceiling even after
resolveGatewayCallTimeoutwas fixed. Switched to forwardingundefinedwhen--timeoutwas omitted socallGatewayresolvesthe proper default plus env/config overrides the same way other
callers do.
Also surfaces
EventLoopReadyResult.maxDriftMson the timeout error(
eventLoopMaxDriftMsfield onGatewayTransportErrorand its JSONshape), with a one-line hint in the error message when drift >= 1 000 ms
so operators stop blaming the gateway for what was actually a wedged CLI
process.
Empirical evidence
Reproduced on a Linux x86_64 host (Node 24.15.0, 45+ bundled gateway
plugins) where the gateway responds with
hello-okin 15 msserver-side but the CLI's event loop is blocked 28.5+ s by
synchronous plugin discovery and module loading
(
HEARTBEAT-LATE dt=28536ms). The connect-response message event thenqueues behind the wedge and the previous 10 s handshake budget fires
before the response is dispatched. The same dist completes the same
handshake in 105 ms on an aarch64 host (RPi, Node 22.22.2) where there
is no startup wedge.
A new diagnostic surface in
EventLoopReadyResult.maxDriftMsand theeventLoopMaxDriftMsfield onGatewayTransportErrormakes the wedgevisible in the timeout error itself.
Test plan
pnpm vitest run src/gateway/call.test.ts src/gateway/handshake-timeouts.test.ts— all 285 tests pass, including three new regressions for default reachability, drift propagation, and the message-hint threshold.pnpm vitest run src/cli/devices-cli.test.ts src/cli/gateway-cli/ src/cli/nodes-cli/— all 97 tests pass on the changed CLI wrappers.upstream/main: this branch has 16 test deltas insrc/cli/plugins-cli.list.test.tsandsrc/cli/plugins-cli.policy.test.ts. All 16 are test-pollution failures — they pass cleanly when run in isolation, the assertion failures are about plugin doctor output / missing plugins (unrelated to handshake timeouts), and the underlying tests already fail at lower volumes (12 + 2 = 14) onupstream/mainitself. The pollution surface is sensitive to suite ordering and not introduced by this PR.openclaw devices listnow completes successfully against the same backend that previously errored after 10 s, with the handshake completing within tens of milliseconds once the CLI's event loop frees up.🤖 Generated with Claude Code