Skip to content

fix(gateway): tolerate transient pre-hello closes in CLI calls#54475

Closed
ruanrrn wants to merge 5 commits into
openclaw:mainfrom
ruanrrn:fix/cron-cli-prehello-close
Closed

fix(gateway): tolerate transient pre-hello closes in CLI calls#54475
ruanrrn wants to merge 5 commits into
openclaw:mainfrom
ruanrrn:fix/cron-cli-prehello-close

Conversation

@ruanrrn

@ruanrrn ruanrrn commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: One-shot gateway calls from the CLI (for example openclaw cron status / openclaw cron list --json) can fail with gateway closed (1000): when the first WebSocket connection closes with code 1000 and an empty reason before hello-ok, even though the client would immediately reconnect and complete the handshake successfully.
  • Why it matters: This surfaces as a flaky but user-visible failure path for CLI commands while the gateway itself is healthy, which breaks automation and undermines trust in the CLI.
  • What changed: executeGatewayRequestWithScopes(...) now tracks whether hello-ok has been seen and ignores only a pre-hello close with code 1000 and an empty reason; GatewayClient.connect treats that same transient connect error as non-fatal so the existing reconnect logic can complete; added focused tests covering both the one-shot call path and the client reconnect behavior.
  • What did NOT change (scope boundary): No changes to auth/secret handling, no changes to other close codes or non-empty close reasons, no changes to broader reconnect policy, and no config or CLI surface changes.

(AI-assisted: drafted and iterated with an OpenClaw workspace agent; changes and tests were reviewed and verified by a human.)

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes: N/A (fixing a locally diagnosed CLI gateway handshake failure that did not yet have a public issue)
  • Related: internal debugging of transient pre-hello 1000 closes on CLI gateway calls
  • This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

  • Root cause: The one-shot gateway request wrapper (executeGatewayRequestWithScopes) treated any onClose event as terminal, even if it happened before hello-ok. In some environments, the underlying WebSocket stack can emit a transient 1000 close with an empty reason during the initial handshake, while GatewayClient reconnects and later completes hello successfully.
  • Missing detection / guardrail: There was no distinction between a transient pre-hello close that can be recovered by reconnect vs. a final close after a completed hello. The wrapper did not track whether hello-ok had been seen and did not special-case this transient 1000 path.
  • Prior context: This shows up as flaky failures on CLI commands like openclaw cron status/list with gateway closed (1000): even though the gateway is reachable and the Control UI can talk to it.
  • Why this regressed now: This is a behavior interaction between the CLI wrapper and gateway/WebSocket behavior that surfaced more often in certain deployments; it is not tied to a single recent refactor, but to the lack of pre-hello vs post-hello distinction in the close handling.
  • If unknown, what was ruled out: Verified that the gateway itself was healthy (Control UI worked, direct WebSocket loopback was fine) and that the failure was in the CLI handshake path, not gateway liveness.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file:
    • src/gateway/call.test.ts: new test ensuring callGateway({ method: "health" }) waits through a transient pre-hello 1000 close and resolves successfully once hello-ok arrives.
    • src/gateway/client.test.ts: new test ensuring GatewayClient suppresses the spurious gateway connect failed: ... (1000) log + onConnectError callback when a pre-hello 1000 close is followed by a successful reconnect and hello-ok.
  • Scenario the test should lock in:
    • A CLI one-shot gateway call that sees a transient pre-hello 1000 + empty-reason close should not fail early; it should wait for the reconnect and complete normally.
    • The same transient close should not be logged as a connect-failed error or reported to onConnectError when the reconnect ultimately succeeds.
  • Why this is the smallest reliable guardrail:
    • The bug is localized to the gateway call wrapper and GatewayClient connect path. Unit-level tests against these two entry points are sufficient to prevent future regressions in this exact transient-close path without requiring heavier end-to-end scenarios.
  • Existing test that already covers this (if any): None.
  • If no new test is added, why not: N/A (tests added).

User-visible / Behavior Changes

  • CLI calls that go through callGateway/executeGatewayRequestWithScopes are more robust against a specific transient handshake pattern: a pre-hello WebSocket close with code 1000 and an empty reason.
  • Commands like openclaw cron status and openclaw cron list --json should no longer fail sporadically with gateway closed (1000): on otherwise healthy gateways.
  • No change to error behavior for other close codes, non-empty close reasons, auth failures, or timeouts.

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No

If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Linux (container/VM environment)
  • Runtime/container: Node 22, pnpm-managed workspace
  • Model/provider: N/A (gateway + CLI path only)
  • Integration/channel (if any): CLI (openclaw cron)
  • Relevant config (redacted): Gateway reachable on local loopback, CLI configured to talk to the local gateway.

Steps

  1. Start a local gateway and configure the CLI against it.
  2. On a build without this fix, run openclaw cron status or openclaw cron list --json repeatedly in an environment where the WebSocket stack occasionally emits a transient pre-hello 1000 + empty-reason close.
  3. Observe intermittent failures with gateway closed (1000): even though the Control UI and direct WebSocket loopback remain healthy.
  4. Apply this change and re-run the same CLI commands.

Expected

  • CLI commands should tolerate the transient pre-hello 1000 + empty-reason close, wait for the reconnect, and succeed normally when the gateway is healthy.

Actual

  • Before: CLI calls can fail immediately with gateway closed (1000): on the first handshake, despite the gateway being healthy and reconnects succeeding under the hood.
  • After: The same commands complete successfully; the transient close is no longer reported as a fatal error, and the new unit tests cover this behavior.

Evidence

  • Failing test/log before + passing after (simulated via new unit tests that reproduce the pre-hello 1000 close pattern and confirm the corrected behavior)
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • corepack pnpm exec vitest run --config vitest.gateway.config.ts src/gateway/call.test.ts src/gateway/client.test.ts on the patched branch.
    • Manual verification on a local install that previously exhibited this bug: openclaw cron status and openclaw cron list --json no longer fail with gateway closed (1000): when the gateway is healthy.
  • Edge cases checked:
    • Only the specific pre-hello + code 1000 + empty reason path is suppressed; other close codes or non-empty reasons still fail the call as before.
    • Timeouts still behave as before; this change does not alter timeout behavior.
  • What you did not verify:
    • Full pnpm build && pnpm check && pnpm test for the entire monorepo; this PR focuses on gateway client + call behavior and is covered by targeted gateway tests.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment. (No prior bot conversations for this PR; marked proactively.)

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No

If yes, exact upgrade steps: N/A

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
    • Revert commit fbe47e1874 (or the PR merge commit) to restore the prior executeGatewayRequestWithScopes and GatewayClient.connect behavior.
  • Files/config to restore:
    • src/gateway/call.ts
    • src/gateway/client.ts
    • src/gateway/call.test.ts
    • src/gateway/client.test.ts
  • Known bad symptoms reviewers should watch for:
    • If a real, unrecoverable pre-hello 1000 close starts occurring in the wild, CLI callers might now see a timeout instead of an immediate gateway closed (1000): error for that specific pattern.

Risks and Mitigations

  • Risk:
    • Misclassifying a non-recoverable pre-hello 1000 + empty-reason close as transient, causing the CLI to wait for a reconnect that never meaningfully succeeds and eventually hit a timeout instead of failing immediately.
    • Mitigation: The suppression is narrowly scoped to pre-hello + code 1000 + empty reason. Any close with a reason, any non-1000 code, or any close after hello-ok still fails the call as before. Timeouts are unchanged, so a genuinely broken gateway still propagates as a failure rather than silently succeeding.
  • Risk:
    • Future changes to error message formatting could break the GatewayClient.connect string match for the transient gateway closed (1000): error.
    • Mitigation: The new unit tests in src/gateway/client.test.ts exercise this behavior explicitly; if the error message shape changes in the future, the tests should fail and force an update to the matching logic or a move to more structured error signaling.

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: S labels Mar 25, 2026
@greptile-apps

greptile-apps Bot commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a flaky CLI failure where one-shot gateway calls (callGateway) could immediately throw gateway closed (1000): even on a healthy gateway, because the first WebSocket connection occasionally emits a transient code-1000 close with an empty reason before the hello-ok handshake completes.

Changes:

  • call.tsexecuteGatewayRequestWithScopes now tracks a helloOkSeen flag; a pre-hello close with code 1000 and an empty reason is silently ignored so the existing reconnect logic can complete the handshake.
  • client.tsGatewayClient.sendConnect suppresses the onConnectError callback and the logError call for the same transient pre-hello close, preventing spurious "gateway connect failed" log lines.
  • Two focused unit tests cover both the one-shot wrapper path (call.test.ts) and the GatewayClient reconnect path (client.test.ts).

Notes:

  • The fix is correctly scoped: only pre-hello + code 1000 + empty reason is suppressed; all other close codes, non-empty reasons, and post-hello closes behave exactly as before.
  • The transient-close detection in client.ts relies on regex-matching err.message against the error template that produces it. This is functionally correct today but is a maintenance fragility: if the error message template on line 296 of client.ts is ever changed, the detection would silently stop working. A structured error type or tagged sentinel would be more robust long-term.

Confidence Score: 4/5

  • Safe to merge; the fix is well-scoped and well-tested with one minor maintainability concern in the detection approach.
  • Logic is correct: the helloOkSeen guard in call.ts and the isTransientPreHelloClose guard in client.ts are both correctly implemented and tested. The only non-blocking concern is that client.ts detects the transient close by regex-matching the human-readable error message string rather than using a structured sentinel, making it fragile to future message-copy changes. This does not affect correctness today and the existing tests will catch a regression if the message format changes. No security impact, no auth changes, no new network calls.
  • No files require special attention beyond the regex-matching fragility noted in src/gateway/client.ts lines 523–528.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gateway/client.ts
Line: 523-528

Comment:
**String-based error detection is fragile**

The transient close detection relies on matching `err.message` against a hardcoded regex that must stay in sync with the error message template two hundred lines away:

```ts
// line 296 – error construction
this.flushPendingErrors(new Error(`gateway closed (${code}): ${reasonText}`));

// line 526 – detection regex
/^gateway closed \(1000\):\s*$/.test(err.message)
```

If the template on line 296 is ever changed (e.g. adding a suffix, changing the separator, or including `reasonText` differently for the empty-string case), the regex would silently stop matching and CLI callers would revert to the pre-fix behaviour — failing immediately on transient pre-hello closes — without any test failure in unrelated paths. The new tests in `client.test.ts` do guard this specific message, but only for the test scenario as written.

A more robust alternative would be to use a structured sentinel (e.g. a typed error subclass `GatewayPreHelloCloseError`, or a tagged `{ isPreHelloClose: true }` field) rather than pattern-matching on a human-readable message string. This would be immune to message-copy changes and would not require the regex to be kept in sync.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(gateway): tolerate transient pre-hel..." | Re-trigger Greptile

Comment thread src/gateway/client.ts
@ruanrrn ruanrrn force-pushed the fix/cron-cli-prehello-close branch from 012cb0d to af48d50 Compare March 25, 2026 16:21
@ruanrrn ruanrrn force-pushed the fix/cron-cli-prehello-close branch from af48d50 to 29473d1 Compare April 27, 2026 13:43
@clawsweeper

clawsweeper Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Keep this PR open: current main and the latest release still reject the narrow pre-hello clean-close path, and this branch remains the canonical implementation candidate, but it is not merge-ready because it conflicts with main and still lacks inspectable real behavior proof.

Canonical path: Close this PR as superseded by #85253.

So I’m closing this here and keeping the remaining discussion on #85253.

Review details

Best possible solution:

Close this PR as superseded by #85253.

Do we have a high-confidence way to reproduce the issue?

Yes for the narrow event sequence by source inspection: current main rejects an unsettled onClose before hello-ok, and the PR adds simulated close-then-hello tests. I did not establish a live environment that naturally emits the transient close.

Is this the best way to solve the issue?

Yes if refreshed: the PR keeps the fix in the gateway call/client layer, scopes suppression to pre-hello code 1000 with an empty reason, and preserves fatal behavior for other close cases. It still needs conflict resolution, changed-surface validation, and inspectable real CLI proof before merge.

Security review:

Security review cleared: The diff is limited to gateway close/error handling, tests, and changelog text, with no dependency, workflow, permission, package, install, or secret-handling changes found.

What I checked:

  • linked superseding PR: [Fix] Surface gateway connect assembly failures #85253 ([Fix] Surface gateway connect assembly failures) is merged at 2026-05-22T07:40:07Z.
  • cluster evidence: the durable review links that PR in the work cluster or recommended risk path.
  • no human follow-up: live comments and timeline hydrated by apply contain no non-automation activity after the ClawSweeper review.

Likely related people:

  • steipete: Recent history shows gateway client/call handshake and reconnect work in the affected files, including early connect challenge and websocket auth logging changes. (role: recent gateway handshake contributor; confidence: high; commits: 3243c9b5b064, bf40baaa4dfa, 56b071400418; files: src/gateway/client.ts, src/gateway/client.test.ts, src/gateway/call.ts)
  • samzong: Merged PR [Fix] Surface gateway connect assembly failures #85253 changed the same call/client handshake path for connect assembly failures and now conflicts with this branch's surface. (role: recent adjacent gateway connect-failure contributor; confidence: high; commits: 0e47815e6ec9, 318a3019a7e2, f817d59fb2e5; files: src/gateway/call.ts, src/gateway/client.ts, src/gateway/call.test.ts)
  • vincentkoc: The PR timeline records a narrow repair pushed to this canonical branch, and local history shows recent gateway-adjacent source work in the same files. (role: branch repair author and recent area contributor; confidence: medium; commits: 2a2eb2759ade, e089d02714c2, 935bd6de7fcb; files: src/gateway/call.ts, src/gateway/client.ts, src/gateway/call.test.ts)
  • Takhoffman: History shows bounded unanswered gateway client request work touching the same one-shot call/client request-settlement surface. (role: adjacent gateway call wrapper contributor; confidence: medium; commits: 5fc43ff0ecd8; files: src/gateway/call.ts, src/gateway/call.test.ts, src/gateway/client.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 6d9b3887ea26.

@vincentkoc

Copy link
Copy Markdown
Member

ProjectClownfish pushed a narrow repair to this branch so the original contributor path can stay canonical.

Source PR: #54475
Validation: pnpm check:changed
Contributor credit is preserved in the branch history and PR context.

@ruanrrn ruanrrn force-pushed the fix/cron-cli-prehello-close branch from 6f5aea4 to e089d02 Compare April 30, 2026 01:42
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 19, 2026
@openclaw-barnacle openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 19, 2026
@clawsweeper clawsweeper Bot added P2 Normal backlog priority with limited blast radius. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. labels May 19, 2026
@clawsweeper

clawsweeper Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@clawsweeper

clawsweeper Bot commented May 24, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper applied the proposed close for this PR.

@clawsweeper clawsweeper Bot closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clawsweeper Tracked by ClawSweeper automation gateway Gateway runtime merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. P2 Normal backlog priority with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: M status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants