Skip to content

fix(gateway): use default handshake timeout in wrapper, expose event-loop drift on timeout#85424

Closed
ScientificProgrammer wants to merge 6 commits into
openclaw:mainfrom
ScientificProgrammer:fix/gateway-handshake-timeout-slow-startup
Closed

fix(gateway): use default handshake timeout in wrapper, expose event-loop drift on timeout#85424
ScientificProgrammer wants to merge 6 commits into
openclaw:mainfrom
ScientificProgrammer:fix/gateway-handshake-timeout-slow-startup

Conversation

@ScientificProgrammer

Copy link
Copy Markdown
Contributor

Summary

Repairs two related bugs that together produced "gateway timeout after
10000ms" errors on slow-startup CLI environments even when the gateway
itself responded in tens of milliseconds:

  1. src/gateway/call.ts resolveGatewayCallTimeout silently fell back
    to a hardcoded 10 000 ms whenever neither OPENCLAW_HANDSHAKE_TIMEOUT_MS
    nor gateway.handshakeTimeoutMs was set —
    DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS (15 000) was unreachable in
    that path. Raised the default to 45 000 ms to absorb slow-startup
    environments and dropped the silent-fallback gate.
  2. Seven CLI wrappers (gateway-cli/call.ts, gateway-cli/register.ts,
    gateway-rpc.runtime.ts, nodes-cli/rpc.ts,
    nodes-cli/rpc.runtime.ts, devices-cli.ts,
    devices-cli.runtime.ts) each registered the --timeout <ms>
    Commander option with an explicit default of "10000" and passed
    Number(opts.timeout ?? 10_000) to callGateway, which silently
    capped every CLI invocation at the old 10 s ceiling even after
    resolveGatewayCallTimeout was fixed. Switched to forwarding
    undefined when --timeout was omitted so callGateway resolves
    the proper default plus env/config overrides the same way other
    callers do.

Also surfaces EventLoopReadyResult.maxDriftMs on the timeout error
(eventLoopMaxDriftMs field on GatewayTransportError and its JSON
shape), with a one-line hint in the error message when drift >= 1 000 ms
so operators stop blaming the gateway for what was actually a wedged CLI
process.

Empirical evidence

Reproduced on a Linux x86_64 host (Node 24.15.0, 45+ bundled gateway
plugins) where the gateway responds with hello-ok in 15 ms
server-side but the CLI's event loop is blocked 28.5+ s by
synchronous plugin discovery and module loading
(HEARTBEAT-LATE dt=28536ms). The connect-response message event then
queues behind the wedge and the previous 10 s handshake budget fires
before the response is dispatched. The same dist completes the same
handshake in 105 ms on an aarch64 host (RPi, Node 22.22.2) where there
is no startup wedge.

A new diagnostic surface in EventLoopReadyResult.maxDriftMs and the
eventLoopMaxDriftMs field on GatewayTransportError makes the wedge
visible in the timeout error itself.

Test plan

  • pnpm vitest run src/gateway/call.test.ts src/gateway/handshake-timeouts.test.ts — all 285 tests pass, including three new regressions for default reachability, drift propagation, and the message-hint threshold.
  • pnpm vitest run src/cli/devices-cli.test.ts src/cli/gateway-cli/ src/cli/nodes-cli/ — all 97 tests pass on the changed CLI wrappers.
  • Comparison with upstream/main: this branch has 16 test deltas in src/cli/plugins-cli.list.test.ts and src/cli/plugins-cli.policy.test.ts. All 16 are test-pollution failures — they pass cleanly when run in isolation, the assertion failures are about plugin doctor output / missing plugins (unrelated to handshake timeouts), and the underlying tests already fail at lower volumes (12 + 2 = 14) on upstream/main itself. The pollution surface is sensitive to suite ordering and not introduced by this PR.
  • End-to-end on a slow-startup environment: openclaw devices list now completes successfully against the same backend that previously errored after 10 s, with the handshake completing within tens of milliseconds once the CLI's event loop frees up.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 22, 2026 15:38
@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime cli CLI command changes size: M labels May 22, 2026
@openclaw-barnacle openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 22, 2026
@clawsweeper

clawsweeper Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs changes before merge.

Latest ClawSweeper review: 2026-05-24 13:24 UTC / May 24, 2026, 9:24 AM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

PR Surface
Source +94, Tests +114, Other +2. Total +210 across 14 files.

View PR surface stats
Area Files Added Removed Net
Source 10 150 56 +94
Tests 3 125 11 +114
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 1 3 1 +2
Total 14 278 68 +210

Summary
The PR changes gateway call timeout resolution and CLI wrappers so omitted --timeout uses the shared handshake budget, and adds event-loop drift metadata to gateway timeout errors with regression tests.

Reproducibility: yes. Current main and the latest release show a 10s unconfigured fallback while the source/docs default is 15s, and the contributor supplied redacted pre/post terminal proof from a real gateway path.

PR rating
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Summary: Strong real terminal proof and a focused core fix, with ordinary readiness reduced by remaining unrelated diff cleanup and stale proof-check metadata.

Rank-up moves:

  • Reset the unrelated formatter-only changes outside the gateway timeout surface.
  • Update the PR body to remove the reverted 45s default claim, and include the proof in the structured proof section if the required check remains red.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Sufficient (terminal): The contributor posted redacted terminal proof showing the pre-fix 10000ms timeout and the post-fix openclaw devices list completing against the same gateway path.

Risk before merge

  • The cumulative GitHub diff still includes unrelated line-wrapping in scripts, agent runtime, and gateway handoff test files, broadening review/ownership for a gateway timeout fix.
  • The latest head has a failing Real behavior proof check because the proof is in a comment rather than the parsed PR-body section, even though the comment evidence is sufficient for this review.
  • I did not execute the Vitest lanes or live slow-startup repro in this read-only pass; validation here relies on source inspection, diff review, CI metadata, history, and contributor terminal proof.

Maintainer options:

  1. Decide the mitigation before merge
    Land the scoped timeout/default-reachability and drift-diagnostic change after removing unrelated formatter-only files and keeping the documented 15s default unchanged.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge
A narrow repair can drop the remaining unrelated formatter-only files; the core gateway behavior does not need an automated code repair.

Security
Cleared: No concrete security or supply-chain regression was found; the supply-chain script touch is line-wrapping only and should be removed as unrelated scope churn.

Review findings

  • [P3] Drop unrelated formatter-only files — src/agents/model-auth.ts:862
Review details

Best possible solution:

Land the scoped timeout/default-reachability and drift-diagnostic change after removing unrelated formatter-only files and keeping the documented 15s default unchanged.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main and the latest release show a 10s unconfigured fallback while the source/docs default is 15s, and the contributor supplied redacted pre/post terminal proof from a real gateway path.

Is this the best way to solve the issue?

Yes for the core behavior. Deferring omitted CLI timeouts to callGateway is the narrow maintainable fix, but the branch should drop unrelated formatting churn before merge.

Label justifications:

  • P2: This is a bounded gateway/CLI reliability fix with real user impact but limited blast radius and no emergency failure mode.
  • rating: 🐚 platinum hermit: Current PR rating is 🐚 platinum hermit because proof is 🦞 diamond lobster, patch quality is 🐚 platinum hermit, and Strong real terminal proof and a focused core fix, with ordinary readiness reduced by remaining unrelated diff cleanup and stale proof-check metadata.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The contributor posted redacted terminal proof showing the pre-fix 10000ms timeout and the post-fix openclaw devices list completing against the same gateway path.
  • proof: sufficient: Contributor real behavior proof is sufficient. The contributor posted redacted terminal proof showing the pre-fix 10000ms timeout and the post-fix openclaw devices list completing against the same gateway path.

Full review comments:

  • [P3] Drop unrelated formatter-only files — src/agents/model-auth.ts:862
    The branch still changes this agent auth line only by collapsing formatting, and the same formatter-only churn remains in scripts/generate-npm-shrinkwrap.mjs, src/agents/pi-embedded-runner/model.ts, and src/gateway/server-methods/update-managed-service-handoff.test.ts. Those files are unrelated to the gateway timeout fix and broaden the merge surface; please reset them before merge.
    Confidence: 0.93

Overall correctness: patch is correct
Overall confidence: 0.86

Acceptance criteria:

  • git diff --check
  • node scripts/run-vitest.mjs src/gateway/call.test.ts src/gateway/handshake-timeouts.test.ts
  • node scripts/run-vitest.mjs src/cli/devices-cli.test.ts src/cli/gateway-cli src/cli/nodes-cli

What I checked:

Likely related people:

  • steipete: Current-main blame attributes the timeout resolver, handshake timeout constants, and the current CLI wrapper defaults to recent gateway/CLI work; older history also shows gateway timeout wiring and devices CLI authorship by Peter Steinberger. (role: recent area contributor; confidence: high; commits: b972ac194042, f5408d82d210, d88b239d3c8a; files: src/gateway/call.ts, src/gateway/handshake-timeouts.ts, src/cli/devices-cli.ts)
  • Shakker: Prior merged CLI lazy-runtime refactors are directly adjacent to the gateway and nodes runtime wrapper paths changed here. (role: runtime refactor author; confidence: medium; commits: 23422ccb6842, 36c82827950f; files: src/cli/gateway-rpc.runtime.ts, src/cli/nodes-cli/rpc.runtime.ts, src/cli/nodes-cli/rpc.ts)
  • BradGroux: Earlier gateway handshake reliability work for slow-startup environments is closely related to the timeout/drift diagnostic behavior in this PR. (role: prior slow-startup handshake contributor; confidence: medium; commits: 6e94b047e2da; files: src/gateway/call.ts, src/gateway/handshake-timeouts.ts)
  • coygeek: Recent devices approval behavior and tests are near the devices timeout wrapper path changed by the PR. (role: recent devices CLI contributor; confidence: medium; commits: 192ee081e77e; files: src/cli/devices-cli.ts, src/cli/devices-cli.test.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 5be62e779b2e.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes gateway handshake timeouts being unintentionally capped at 10s (especially via CLI wrappers), raises the default preauth handshake timeout to better tolerate slow-startup environments, and propagates event-loop drift diagnostics into timeout errors for better operator attribution.

Changes:

  • Raise DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS to 45s and make the default reachable in resolveGatewayCallTimeout (removing the silent 10s fallback path).
  • Update multiple CLI wrappers to omit a hardcoded --timeout default and forward undefined when --timeout is not provided, allowing env/config/default timeout resolution to work as intended.
  • Attach eventLoopMaxDriftMs to timeout transport errors (and JSON shape) and add a conditional human hint when drift is large; add/adjust tests accordingly.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/gateway/handshake-timeouts.ts Raises default preauth handshake timeout to 45s and documents rationale.
src/gateway/handshake-timeouts.test.ts Updates tests to be robust against future default timeout bumps.
src/gateway/event-loop-ready.ts Raises readiness max-wait default to 45s (intended to align with handshake).
src/gateway/call.ts Fixes default timeout resolution path; adds eventLoopMaxDriftMs to timeout errors + JSON; adds drift-based hint to timeout message.
src/gateway/call.test.ts Adds regressions for default timeout reachability and drift propagation/hinting.
src/cli/nodes-cli/rpc.ts Removes hardcoded 10s --timeout default unless an explicit per-command default is provided.
src/cli/nodes-cli/rpc.runtime.ts Forwards undefined timeout when --timeout is omitted so gateway-side/default resolution applies.
src/cli/gateway-rpc.runtime.ts Stops hardcoding 10s timeout; forwards undefined when --timeout omitted.
src/cli/gateway-cli/register.ts Removes hardcoded --timeout default; updates help text to reflect handshake-budget defaulting.
src/cli/gateway-cli/call.ts Removes hardcoded --timeout default; forwards undefined when omitted.
src/cli/devices-cli.ts Removes hardcoded 10s --timeout default unless per-command default is supplied; updates help text.
src/cli/devices-cli.runtime.ts Stops defaulting to 10s in runtime; only forwards explicit timeout and always includes timeout in re-run command if provided.

Comment thread src/gateway/call.ts Outdated
Comment on lines +617 to +624
`\n\nNote: this CLI process's event loop was blocked for ` +
`${Math.round(eventLoopMaxDriftMs)}ms during the handshake, ` +
"which usually means the timeout fired because the CLI was busy " +
"(heavy module discovery, JIT compile, sync I/O) rather than because " +
"the gateway is down. The gateway's response may have arrived just " +
"after the timer fired. Try raising the budget with " +
"`OPENCLAW_HANDSHAKE_TIMEOUT_MS=<ms>` or by setting " +
"`gateway.handshakeTimeoutMs` in openclaw.json.";
Comment thread src/gateway/event-loop-ready.ts Outdated
Comment on lines 19 to 26
// Aligned with DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS so the readiness wait and
// the outer handshake timer give up on roughly the same deadline. A 10 s wait
// is not long enough to ride out a slow CLI startup on lower-end x86_64 hosts
// with many gateway plugins installed (observed: ~30 s of event-loop blocking
// from module discovery / JIT compile during the first `openclaw devices list`
// after a cold launcher start). See handshake-timeouts.ts for the discussion.
const DEFAULT_MAX_WAIT_MS = 45_000;
const DEFAULT_INTERVAL_MS = 1;
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 22, 2026
@clawsweeper

clawsweeper Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

✨ Hatched: 🥚 common Frosted Proofling

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 🥚 common.
Trait: sleeps inside passing CI.
Image traits: location status garden; accessory commit compass; palette pearl, teal, and neon green; mood focused; pose standing beside its cracked shell; shell brushed metal shell; lighting cool dashboard glow; background tiny artifact crates.
Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Frosted Proofling in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@ScientificProgrammer

Copy link
Copy Markdown
Contributor Author

Thanks for the careful review. Updating the PR to address both contributor asks.

Scope change: reverting the default-timeout bump

ClawSweeper flagged a docs/type/protocol-contract mismatch (the PR raised DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS from 15_000 to 45_000 without updating the public configuration reference, the GatewayConfig JSDoc, or the protocol-side default), plus asked maintainers to explicitly accept a 3× longer preauth socket lifetime.

I'm reverting that part of the change. After reconsidering, the actual root cause is what's described in the PR title — the documented 15 s default was unreachable from the CLI wrappers and from resolveGatewayCallTimeout's unconfigured path. The wrapper-and-resolver fix alone (fix(cli): drop hardcoded 10 s gateway timeout in CLI wrappers and the call-side change in fix(gateway): use default handshake timeout in wrapper, expose event-loop drift) restores reachability of the existing 15 s budget, which is what should have been the merge-target all along.

The slow-startup host I observed (~22 s of CLI-side event-loop blocking that pushed the handshake past 15 s) is symptomatic of a separate problem in CLI plugin/discovery startup, not of the gateway-side budget being too tight. I'm tracking that independently; it shouldn't drive a change to the shared preauth default.

A new commit on this branch reverts DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS to 15_000 and DEFAULT_MAX_WAIT_MS (in event-loop-ready.ts) back to its pre-PR 10_000. No documentation update is needed because the runtime value now matches the public configuration reference and the GatewayConfig.handshakeTimeoutMs JSDoc again.

Real behavior proof

ClawSweeper asked for inspectable after-fix terminal output. Here it is. Private LAN IPv4 addresses and per-device public-key fingerprints are redacted; loopback IPs, role/scope/token columns, and timing markers are preserved.

Pre-fix (current upstream/main): gateway timeout after 10000ms

Captured during the original investigation against the unpatched build (note the dist/call-DC9-4tbc.js path — this is the bundled CLI on the unpatched 2026.5.19 install):

…
NET 1927854: destroy / close                                    ← all sockets torn down
[openclaw] Reason: gateway timeout after 10000ms
[openclaw] GatewayTransportError: at createGatewayTimeoutTransportError (file:///…/dist/call-DC9-4tbc.js:273:9)
                                    at Timeout.<anonymous> (file:///…/dist/call-DC9-4tbc.js:358:9)

The 10000ms in the error message comes directly from the DEFAULT_DEVICES_TIMEOUT_MS = 1e4 constant in the bundled CLI — i.e., from the hardcoded CLI-wrapper timeout this PR removes, not from the DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS constant. That's exactly the wedge the wrapper fix targets.

Post-fix: openclaw devices list completes against the local gateway

Captured 2026-05-22 10:58 CDT on the same host, against an install that includes this PR's wrapper-and-resolver fix:

$ openclaw devices list
│
◇
Config warnings:
- plugins.entries.googlechat: plugin googlechat: duplicate plugin id detected; global plugin will be overridden by bundled plugin (/home/eric/.npm-global/lib/node_modules/openclaw/dist/extensions/googlechat/index.js)

OpenClaw 2026.5.19 (f25f3d7) — Shell yeah—I'm here to pinch the toil and leave you the glory.

[…config-warnings panel and a second identical plugins.entries.googlechat warning elided for brevity…]

10:55:19 [plugins] loading anthropic from /home/eric/.npm-global/lib/node_modules/openclaw/dist/extensions/anthropic/index.js
10:55:19 [plugins] loading byteplus from /home/eric/.npm-global/lib/node_modules/openclaw/dist/extensions/byteplus/index.js
[…6 more plugin loading lines elided…]
10:55:20 [plugins] loaded 8 plugin(s) (8 attempted) in 311.1ms
10:55:22 [plugins] loading anthropic from /home/eric/.npm-global/lib/node_modules/openclaw/dist/extensions/anthropic/index.js
[…7 more plugin loading lines elided — note this is a doubled load pass, separate scope…]
10:55:22 [plugins] loaded 8 plugin(s) (8 attempted) in 34.9ms
10:55:31 [codex/catalog] codex model discovery failed; using fallback catalog
◇
Paired (30)
┌──────────────────────────────────────────────────────────────────┬─────────────┬───────────────────────────…
│ Device                                                           │ Roles       │ Scopes                     …
├──────────────────────────────────────────────────────────────────┼─────────────┼───────────────────────────…
│ [redacted-device-fingerprint-01]                                 │ operator    │ operator.admin, operator.r…
│ Data                                                             │ operator,   │ operator.read, operator.wr…
│                                                                  │ node        │                            …
│ [redacted-device-fingerprint-02]                                 │ operator    │ operator.admin, operator.r…
[…28 more redacted device rows elided; full 30-device list demonstrated successful response…]
│ Android-Termux                                                   │ node        │                            …
│ [redacted-device-fingerprint-27]                                 │ operator    │ operator.admin, operator.a…
└──────────────────────────────────────────────────────────────────┴─────────────┴───────────────────────────…
$

Full unabridged redacted capture is preserved in my session artifacts; happy to attach as a gist if reviewers want every row.

What this demonstrates

  1. The CLI completes the WebSocket handshake against the local gateway and prints the 30-device paired list. Pre-fix this command emitted gateway timeout after 10000ms and a GatewayTransportError.
  2. The 9-second gap between 10:55:22 [plugins] loaded 8 plugin(s) and 10:55:31 [codex/catalog] shows the CLI does spend meaningful event-loop time on discovery work; that's what was tripping the (pre-fix, hardcoded) 10 s wrapper budget. With the wrapper fix, the documented 15 s default applies in this path and the handshake completes within it.

The duplicate-googlechat warning, the codex/catalog discovery failure, and the doubled plugin-loading pass visible above are unrelated; they're separate-scope items I'm tracking independently.

Copilot's two inline asks

Holding both for a focused follow-up so this PR can stay scoped to the wrapper-and-resolver fix:

  • src/gateway/call.ts:617 — the drift-hint message wording can be made client-neutral (or keyed off clientName/mode). Will land in a small follow-up PR.
  • src/gateway/event-loop-ready.ts:25 — partly addressed by this PR's revert (the literal 45_000 is gone; DEFAULT_MAX_WAIT_MS is back to its pre-PR 10_000 value and is no longer claiming alignment with the handshake default). If you'd still like the constant imported from handshake-timeouts.ts for forward-safety, I'll include it in the same follow-up.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P2 Normal backlog priority with limited blast radius. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 22, 2026
@ScientificProgrammer ScientificProgrammer requested a review from a team as a code owner May 22, 2026 19:04
@openclaw-barnacle openclaw-barnacle Bot added scripts Repository scripts agents Agent runtime and tooling extensions: codex triage: dirty-candidate Candidate: broad unrelated surfaces; may need splitting or cleanup. and removed scripts Repository scripts agents Agent runtime and tooling extensions: codex labels May 22, 2026
ScientificProgrammer and others added 3 commits May 24, 2026 08:05
… expose event-loop drift

resolveGatewayCallTimeout silently fell back to a hardcoded 10_000 ms whenever
neither OPENCLAW_HANDSHAKE_TIMEOUT_MS nor gateway.handshakeTimeoutMs was set,
which made DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS unreachable in this code path.
On lower-end x86_64 hosts with many gateway plugins installed, the CLI's own
event loop is blocked by module discovery / JIT compile during the first
openclaw devices list for ~30 s — well past the 10 s budget — producing
misleading "gateway timeout after 10000ms" errors even though the gateway
itself responded in ~15 ms.

Repro on envy.lan (Linux 6.8 x86_64, Node 24.15.0, 45 plugins):

  [WSD t=    0] MODULE-LOAD client-CBm35uE_.js
  [WSD t=  912] open
  [WSD t=  936] sendConnect-entry
  [WSD t=29425] HEARTBEAT-LATE dt=28536ms (event-loop was blocked 28.5 s)
  [WSD t=29284] message hello-ok bytes=8472
                                            stderr: "gateway timeout after 10000ms"

Same dist on ULTRON (aarch64 RPi, Node 22.22.2): hello-ok at t=105 ms, exit=0.

Changes:
  * Drop the env/config gate in resolveGatewayCallTimeout so the default
    actually applies. Also drop the >10_000 floor check that masked the bug.
  * Raise DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS 15_000 -> 45_000 and align
    DEFAULT_MAX_WAIT_MS in event-loop-ready.ts. The timeout never fires on
    healthy systems (handshakes complete in tens of ms), so this trades a
    small worst-case failure-detection latency for a large reduction in
    spurious "gateway timeout" errors.
  * Thread EventLoopReadyResult.maxDriftMs into the timeout error
    (eventLoopMaxDriftMs field on GatewayTransportError + JSON shape) and
    add a one-line hint in the error message when drift >= 1_000 ms so
    operators stop blaming the gateway for what was actually a wedged CLI.
  * Three regression tests in call.test.ts (default reachability, drift
    propagation, message-hint threshold). Two existing handshake-timeouts
    assertions updated to reference the new default via constant.

Test plan:
- pnpm vitest run src/gateway/{call,handshake-timeouts,event-loop-ready,client-start-readiness}.test.ts
  - 298 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…appers

The CLI wrappers `gateway-cli/call.ts`, `gateway-cli/register.ts`,
`gateway-rpc.runtime.ts`, and `nodes-cli/rpc.runtime.ts` registered the
`--timeout <ms>` Commander option with an explicit default of "10000"
*and* passed `Number(opts.timeout ?? 10_000)` to `callGateway`, which
silently overrode the gateway handshake budget for every CLI invocation
that did not pass `--timeout` explicitly. That made the previous fix
(9652076) reachable only via env/config rather than as the resolved
default for ordinary CLI calls — `openclaw devices list` and other
WS-protocol subcommands still hit the old 10 s ceiling.

Removes the Commander default and switches the runtime forwarders to
pass `undefined` when `--timeout` was not supplied, so callGateway can
resolve `DEFAULT_PREAUTH_HANDSHAKE_TIMEOUT_MS` (or the env/config
override) the same way other callers do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…k in approval rerun command

The previous commit removed `const DEFAULT_DEVICES_TIMEOUT_MS = 10_000;`
from devices-cli.runtime.ts but left one reference behind in
`buildExplicitApproveCommand`, which constructs the "rerun" command line
for `openclaw devices approve` when no pending requestId is supplied.
That dangling reference threw `ReferenceError: DEFAULT_DEVICES_TIMEOUT_MS
is not defined` whenever the rerun-command builder ran.

The check `timeout !== String(DEFAULT_DEVICES_TIMEOUT_MS)` was a
de-duplication step that omitted `--timeout` from the rerun command when
its value equalled the (now removed) hardcoded default. With no numeric
default left, the simpler rule is correct: include `--timeout <value>`
whenever the user supplied one, otherwise omit it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ScientificProgrammer and others added 3 commits May 24, 2026 08:05
The wrapper/resolver fixes earlier in this branch (9652076 plus the
two CLI commits) already restore reachability of the documented 15s
default. The further bump to 45s was speculative scope that introduced
a contract mismatch with docs/gateway/configuration-reference.md and
the GatewayConfig.handshakeTimeoutMs JSDoc, and stretched the preauth
socket lifetime 3x without an explicit maintainer ack.

The observed slow-startup CLI symptom that motivated the bump
(~22 s of event-loop blocking during plugin discovery on one host) is
a CLI-side problem and is being investigated separately; it should not
drive a change to the shared preauth handshake default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ze drift wording

ClawSweeper re-review flagged three P3 items after proof cleared:

- `src/cli/devices-cli.ts` and `src/cli/gateway-cli/register.ts` still had
  parenthetical "(45 000 ms)" / "(45 000 ms by default)" comments referring
  to a default this branch no longer ships. Removed the concrete number
  rather than substituting 15 000 ms, per the reviewer's suggestion to
  avoid pinning a value that can drift again.
- `src/gateway/call.ts` event-loop drift hint hardcoded "this CLI process's
  event loop" / "the CLI was busy", but `callGateway` and
  `GatewayTransportError` are emitted by non-CLI callers too (backend,
  runtime). Reworded to "this process's event loop" / "the caller was busy"
  so the message is correct for every caller.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClawSweeper re-review flagged that nine files in the branch differ from
upstream/main only by auto-formatter line-wrap output (function signatures
and call sites expanded to multi-line; no behavior change). The churn
predates the current proof/cleanup work — likely an oxfmt --write pass
during initial branch construction — and is unrelated to the gateway
handshake-timeout fix this PR is meant to land.

Reset these files to upstream/main so the PR diff stays scoped to the
timeout fix and the drift-diagnostic change:

- extensions/codex/src/app-server/config.test.ts
- extensions/codex/src/app-server/transport-stdio.test.ts
- scripts/generate-npm-shrinkwrap.mjs
- src/agents/model-auth.ts
- src/agents/pi-embedded-runner/model.ts
- src/gateway/server-methods/update-managed-service-handoff.test.ts
- src/plugins/current-plugin-metadata-snapshot.ts
- src/plugins/provider-runtime.ts
- test/scripts/install-ps1.test.ts

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ScientificProgrammer ScientificProgrammer force-pushed the fix/gateway-handshake-timeout-slow-startup branch from d05c901 to 69cfecc Compare May 24, 2026 13:18
@openclaw-barnacle openclaw-barnacle Bot added scripts Repository scripts agents Agent runtime and tooling and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 24, 2026
@BingqingLyu

This comment was marked as spam.

@ScientificProgrammer

Copy link
Copy Markdown
Contributor Author

Closing for now. This was more a quality-of-life / observability improvement than a fix for something I'm actively hitting — the default handshake behavior is working fine for me in practice. Withdrawing to keep my open-PR set tidy while main moves; glad to revisit the event-loop-drift-on-timeout telemetry later if it'd be useful upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling cli CLI command changes gateway Gateway runtime P2 Normal backlog priority with limited blast radius. proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. scripts Repository scripts size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. triage: dirty-candidate Candidate: broad unrelated surfaces; may need splitting or cleanup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants