fix(gateway): tolerate transient pre-hello closes in CLI calls#54475
fix(gateway): tolerate transient pre-hello closes in CLI calls#54475ruanrrn wants to merge 5 commits into
Conversation
Greptile SummaryThis PR fixes a flaky CLI failure where one-shot gateway calls ( Changes:
Notes:
Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/gateway/client.ts
Line: 523-528
Comment:
**String-based error detection is fragile**
The transient close detection relies on matching `err.message` against a hardcoded regex that must stay in sync with the error message template two hundred lines away:
```ts
// line 296 – error construction
this.flushPendingErrors(new Error(`gateway closed (${code}): ${reasonText}`));
// line 526 – detection regex
/^gateway closed \(1000\):\s*$/.test(err.message)
```
If the template on line 296 is ever changed (e.g. adding a suffix, changing the separator, or including `reasonText` differently for the empty-string case), the regex would silently stop matching and CLI callers would revert to the pre-fix behaviour — failing immediately on transient pre-hello closes — without any test failure in unrelated paths. The new tests in `client.test.ts` do guard this specific message, but only for the test scenario as written.
A more robust alternative would be to use a structured sentinel (e.g. a typed error subclass `GatewayPreHelloCloseError`, or a tagged `{ isPreHelloClose: true }` field) rather than pattern-matching on a human-readable message string. This would be immune to message-copy changes and would not require the regex to be kept in sync.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(gateway): tolerate transient pre-hel..." | Re-trigger Greptile |
012cb0d to
af48d50
Compare
af48d50 to
29473d1
Compare
|
Thanks for the context here. I swept through the related work, and this is now duplicate or superseded. Keep this PR open: current main and the latest release still reject the narrow pre-hello clean-close path, and this branch remains the canonical implementation candidate, but it is not merge-ready because it conflicts with main and still lacks inspectable real behavior proof. Canonical path: Close this PR as superseded by #85253. So I’m closing this here and keeping the remaining discussion on #85253. Review detailsBest possible solution: Close this PR as superseded by #85253. Do we have a high-confidence way to reproduce the issue? Yes for the narrow event sequence by source inspection: current main rejects an unsettled Is this the best way to solve the issue? Yes if refreshed: the PR keeps the fix in the gateway call/client layer, scopes suppression to pre-hello code 1000 with an empty reason, and preserves fatal behavior for other close cases. It still needs conflict resolution, changed-surface validation, and inspectable real CLI proof before merge. Security review: Security review cleared: The diff is limited to gateway close/error handling, tests, and changelog text, with no dependency, workflow, permission, package, install, or secret-handling changes found. What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 6d9b3887ea26. |
|
ProjectClownfish pushed a narrow repair to this branch so the original contributor path can stay canonical. Source PR: #54475 |
29473d1 to
6f5aea4
Compare
6f5aea4 to
e089d02
Compare
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
|
ClawSweeper applied the proposed close for this PR.
|
Summary
openclaw cron status/openclaw cron list --json) can fail withgateway closed (1000):when the first WebSocket connection closes with code1000and an empty reason beforehello-ok, even though the client would immediately reconnect and complete the handshake successfully.executeGatewayRequestWithScopes(...)now tracks whetherhello-okhas been seen and ignores only a pre-hello close with code1000and an empty reason;GatewayClient.connecttreats that same transient connect error as non-fatal so the existing reconnect logic can complete; added focused tests covering both the one-shot call path and the client reconnect behavior.(AI-assisted: drafted and iterated with an OpenClaw workspace agent; changes and tests were reviewed and verified by a human.)
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
1000closes on CLI gateway callsRoot Cause / Regression History (if applicable)
executeGatewayRequestWithScopes) treated anyonCloseevent as terminal, even if it happened beforehello-ok. In some environments, the underlying WebSocket stack can emit a transient1000close with an empty reason during the initial handshake, whileGatewayClientreconnects and later completeshellosuccessfully.hello-okhad been seen and did not special-case this transient1000path.openclaw cron status/listwithgateway closed (1000):even though the gateway is reachable and the Control UI can talk to it.Regression Test Plan (if applicable)
src/gateway/call.test.ts: new test ensuringcallGateway({ method: "health" })waits through a transient pre-hello1000close and resolves successfully oncehello-okarrives.src/gateway/client.test.ts: new test ensuringGatewayClientsuppresses the spuriousgateway connect failed: ... (1000)log +onConnectErrorcallback when a pre-hello1000close is followed by a successful reconnect andhello-ok.1000+ empty-reason close should not fail early; it should wait for the reconnect and complete normally.onConnectErrorwhen the reconnect ultimately succeeds.GatewayClientconnect path. Unit-level tests against these two entry points are sufficient to prevent future regressions in this exact transient-close path without requiring heavier end-to-end scenarios.User-visible / Behavior Changes
callGateway/executeGatewayRequestWithScopesare more robust against a specific transient handshake pattern: a pre-hello WebSocket close with code1000and an empty reason.openclaw cron statusandopenclaw cron list --jsonshould no longer fail sporadically withgateway closed (1000):on otherwise healthy gateways.Security Impact (required)
Yes/No) NoYes/No) NoYes/No) NoYes/No) NoYes/No) NoIf any
Yes, explain risk + mitigation: N/ARepro + Verification
Environment
openclaw cron)Steps
openclaw cron statusoropenclaw cron list --jsonrepeatedly in an environment where the WebSocket stack occasionally emits a transient pre-hello1000+ empty-reason close.gateway closed (1000):even though the Control UI and direct WebSocket loopback remain healthy.Expected
1000+ empty-reason close, wait for the reconnect, and succeed normally when the gateway is healthy.Actual
gateway closed (1000):on the first handshake, despite the gateway being healthy and reconnects succeeding under the hood.Evidence
1000close pattern and confirm the corrected behavior)Human Verification (required)
What you personally verified (not just CI), and how:
corepack pnpm exec vitest run --config vitest.gateway.config.ts src/gateway/call.test.ts src/gateway/client.test.tson the patched branch.openclaw cron statusandopenclaw cron list --jsonno longer fail withgateway closed (1000):when the gateway is healthy.pre-hello + code 1000 + empty reasonpath is suppressed; other close codes or non-empty reasons still fail the call as before.pnpm build && pnpm check && pnpm testfor the entire monorepo; this PR focuses on gateway client + call behavior and is covered by targeted gateway tests.Review Conversations
Compatibility / Migration
Yes/No) YesYes/No) NoYes/No) NoIf yes, exact upgrade steps: N/A
Failure Recovery (if this breaks)
fbe47e1874(or the PR merge commit) to restore the priorexecuteGatewayRequestWithScopesandGatewayClient.connectbehavior.src/gateway/call.tssrc/gateway/client.tssrc/gateway/call.test.tssrc/gateway/client.test.ts1000close starts occurring in the wild, CLI callers might now see a timeout instead of an immediategateway closed (1000):error for that specific pattern.Risks and Mitigations
1000+ empty-reason close as transient, causing the CLI to wait for a reconnect that never meaningfully succeeds and eventually hit a timeout instead of failing immediately.pre-hello + code 1000 + empty reason. Any close with a reason, any non-1000code, or any close afterhello-okstill fails the call as before. Timeouts are unchanged, so a genuinely broken gateway still propagates as a failure rather than silently succeeding.GatewayClient.connectstring match for the transientgateway closed (1000):error.src/gateway/client.test.tsexercise this behavior explicitly; if the error message shape changes in the future, the tests should fail and force an update to the matching logic or a move to more structured error signaling.