Skip to content

fix(gateway): add WebSocket ping keepalive to prevent idle connection drops#56668

Closed
joelagnel wants to merge 1 commit intoopenclaw:mainfrom
joelagnel:fix/ws-ping-keepalive
Closed

fix(gateway): add WebSocket ping keepalive to prevent idle connection drops#56668
joelagnel wants to merge 1 commit intoopenclaw:mainfrom
joelagnel:fix/ws-ping-keepalive

Conversation

@joelagnel
Copy link
Copy Markdown

Summary

  • Problem: WebSocket webchat connections silently drop (close code 1006) during long tool call chains because the server sends zero WS-level ping frames. Network intermediaries treat the connection as idle and kill it.
  • Why it matters: Users lose live message delivery during tool execution; buffered messages only flood in when a new connection opens on next user action.
  • What changed: Added a 25s server-side WebSocket ping interval per connection in the gateway WS handler, cleared on close.
  • What did NOT change (scope boundary): No client-side changes, no new config options, no protocol changes. The existing application-level tick broadcast is untouched.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause / Regression History (if applicable)

  • Root cause: The gateway WebSocketServer never calls socket.ping(). The existing keepalive is an application-level JSON "tick" event broadcast (every 30s), which is a regular data frame — not a WS protocol ping. Some intermediaries only honor WS-level ping/pong for idle detection. Additionally, the tick broadcast uses dropIfSlow: true, so slow clients can miss ticks entirely, leaving the connection truly idle.
  • Missing detection / guardrail: No WS ping/pong was ever implemented server-side.
  • Prior context: The TICK_INTERVAL_MS application-level keepalive was the only mechanism; WS-level pings were never added.
  • Why this regressed now: Not a regression — this was never implemented. Became visible with longer tool call chains that create 30-60s idle windows.
  • If unknown, what was ruled out: N/A

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • If no new test is added, why not: The fix is a single setInterval + clearInterval — the 20 existing ws-connection tests pass and verify connection lifecycle. The actual bug (intermediary killing idle connections) is a network-level behavior not reproducible in unit tests.

User-visible / Behavior Changes

WebSocket connections now send protocol-level ping frames every 25 seconds, preventing network intermediaries from dropping idle connections during long tool call chains.

Diagram (if applicable)

Before:
[tool call starts] -> [30-60s idle on WS] -> [intermediary kills connection] -> [user sends msg] -> [new conn] -> [message flood]

After:
[tool call starts] -> [ping every 25s keeps conn alive] -> [messages delivered live]

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No (WS ping is a control frame on an existing connection)
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: Node 22+
  • Integration/channel (if any): Webchat / Control UI dashboard

Steps

  1. Open dashboard webchat
  2. Send a message that triggers multiple tool calls (30-60s)
  3. Observe: connection stays alive, messages delivered live

Expected

  • WebSocket connection stays alive during tool execution

Actual

  • (Before fix) Connection drops silently with code 1006

Evidence

  • Failing test/log before + passing after
  • 20/20 existing ws-connection tests pass
  • Gateway logs from issue show [ws] webchat disconnected code=1006 — 12 disconnects in 17 minutes

Human Verification (required)

  • Verified scenarios: All 20 ws-connection tests pass; format check passes
  • Edge cases checked: close() clears the ping interval; socket.ping() is wrapped in try/catch for safety
  • What you did not verify: Live network intermediary behavior (requires deployed gateway + real Tailscale/proxy setup)

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: Minimal additional network traffic (one 2-byte ping frame per connection every 25s)
    • Mitigation: Standard WS best practice per RFC 6455 §5.5.2; negligible overhead

Made with Cursor

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: XS labels Mar 28, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 28, 2026

Greptile Summary

Adds a 25-second server-side WebSocket ping interval (RFC 6455 protocol-level ping, not an application data frame) to prevent network intermediaries from dropping idle connections during long tool call chains. The interval is started on each connection and correctly cleared via clearInterval in the existing close() function, which is invoked on all teardown paths.

Confidence Score: 5/5

Safe to merge — the change is minimal, well-scoped, and follows correct cleanup patterns.

A single setInterval/clearInterval pair with correct teardown on all close paths. No logic errors or P1 issues found. The one P2 finding is an optional comment improvement on the empty catch block.

No files require special attention.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gateway/server/ws-connection.ts
Line: 164-166

Comment:
**Silent ping error discard**

The empty `catch {}` swallows all errors from `socket.ping()` without any log or diagnostic signal. If the socket enters an unexpected error state before `socket.once("error", ...)` fires, repeated silent failures would be invisible in logs. A brief inline comment distinguishing the expected close-race case from unexpected errors would help during incident triage.

```suggestion
      try {
        socket.ping();
      } catch {
        // Socket may be in CLOSING state; clearInterval in close() handles cleanup
      }
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "fix(gateway): add WebSocket ping keepali..." | Re-trigger Greptile

@EronFan
Copy link
Copy Markdown
Contributor

EronFan commented Mar 29, 2026

PR Review: WebSocket ping keepalive

Approve. Clean fix for a real-world gateway stability issue.

Key points:

  • The existing "tick" broadcast is an application-level JSON frame, not a WS protocol ping — intermediaries that only honor WS-level ping/pong for idle detection were killing connections during long tool-call chains
  • Adding a 25s server-side socket.ping() is standard WS best practice (RFC 6455 §5.5.2) and the overhead is negligible (~2 bytes/connection/25s)
  • The clearInterval on close prevents resource leaks
  • The try/catch around socket.ping() is good defensive coding

Note: The 20 existing ws-connection tests passing is reassuring. The actual intermediary behavior (killing idle connections) can't be unit-tested but the fix is straightforward and correct.

LGTM.


Reviewed as part of OpenClaw community contribution tracking

@joelagnel joelagnel force-pushed the fix/ws-ping-keepalive branch from 78cfac0 to 98c3c93 Compare March 29, 2026 15:24
@joelagnel
Copy link
Copy Markdown
Author

No changes during recent force push (I just rebased/restored my fixes branch which is triggering CI again). Patch was LGTM'd already and safe to merge.

@joelagnel
Copy link
Copy Markdown
Author

@greptileai do another review

@openclaw-barnacle
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Apr 30, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 30, 2026

Codex review: needs changes before merge.

What this changes:

The PR adds a 25-second per-connection socket.ping() interval in src/gateway/server/ws-connection.ts and clears that interval during connection teardown.

Required change before merge:

A narrow automated repair can add the missing changelog entry; the final merge versus #74783 still needs maintainer judgment after that cleanup.

Security review:

Security review cleared: The diff only sends WebSocket control frames on existing gateway sockets and does not change dependencies, CI, secrets, permissions, package resolution, install scripts, or command execution paths.

Review findings:

  • [P3] Add the required changelog entry — src/gateway/server/ws-connection.ts:160
Review details

Best possible solution:

Keep exactly one gateway-owned WebSocket protocol ping path. Merge this focused keepalive after adding the changelog entry if maintainers want the minimal idle-drop fix, or land #74783 as canonical if they want ping/pong health telemetry and a shared maintenance-loop owner.

Do we have a high-confidence way to reproduce the issue?

Yes. The linked bug gives concrete WebChat long-tool-call steps and code 1006 logs, and current-main source confirms the gateway only has JSON tick liveness with no WebSocket protocol ping. The network-intermediary kill path still needs live deployment validation before merge.

Is this the best way to solve the issue?

Unclear. This per-connection timer is the narrowest maintainable fix for idle connection drops and matches the ws dependency contract, but #74783 may be the better final shape if maintainers want pong RTT/staleness telemetry and one shared maintenance-loop owner.

Full review comments:

  • [P3] Add the required changelog entry — src/gateway/server/ws-connection.ts:160
    This is a user-facing gateway fix, but the diff only changes ws-connection.ts. Repo policy requires user-facing fix PRs to add a single-line CHANGELOG.md entry under Unreleased Fixes with contributor credit, so the idle-drop behavior change would otherwise be omitted from release notes.
    Confidence: 0.93

Overall correctness: patch is correct
Overall confidence: 0.88

Acceptance criteria:

  • git diff --check
  • pnpm exec oxfmt --check --threads=1 CHANGELOG.md

What I checked:

  • Current main lacks gateway protocol ping: A focused search for socket.ping / .ping( in src/gateway/server and src/gateway/client.ts returned no matches on current main, and attachGatewayWsConnectionHandler currently sets up connection state and teardown without a ping timer. (src/gateway/server/ws-connection.ts:185, e0cc374b07c1)
  • Existing keepalive is application-level tick data: startGatewayMaintenanceTimers broadcasts a JSON tick event on TICK_INTERVAL_MS, and protocol docs describe tick as the liveness event; this is distinct from a WebSocket control-frame ping. (src/gateway/server-maintenance.ts:64, e0cc374b07c1)
  • PR diff is narrowly scoped: The PR file list shows one source-file hunk adding setInterval(() => socket.ping(), 25_000) and clearInterval(pingTimer) in the existing close path, with no config, protocol, dependency, CI, or client changes. (src/gateway/server/ws-connection.ts:160, 98c3c939fcb8)
  • ws dependency supports server ping loops: The pinned dependency is ws 8.20.0; upstream 8.20.0 exposes WebSocket#ping() and its README documents server-side interval ping loops for broken connection detection. (package.json:1645, e0cc374b07c1)
  • Required changelog entry is missing: CHANGELOG.md has no Unreleased gateway WebSocket ping / idle-drop entry for this user-facing fix; exact searches only found unrelated Mattermost and generic 1006 entries. (CHANGELOG.md:11, e0cc374b07c1)
  • Related broader PR overlaps ping ownership: Open Add gateway health connection telemetry #74783 adds gateway connection health telemetry using WebSocket protocol ping/pong from a maintenance-loop owner and explicitly says it should be the single ping owner if it lands first. (f264e9514de0)

Likely related people:

  • steipete: Current-main blame and commit metadata in this checkout point to Peter for the visible gateway WebSocket connection and maintenance-loop files around the affected behavior. (role: recent maintainer / likely follow-up owner; confidence: medium; commits: 84c85734a813, e0cc374b07c1; files: src/gateway/server/ws-connection.ts, src/gateway/server-maintenance.ts)
  • trialanderrorstudios: Open Add gateway health connection telemetry #74783 works on the same gateway connection-health surface, adds a WebSocket ping/pong path, and explicitly coordinates ownership with this PR. (role: adjacent owner; confidence: medium; commits: f264e9514de0; files: src/gateway/server/connection-health.ts, src/gateway/server-maintenance.ts, src/gateway/server/ws-connection.ts)

Remaining risk / open question:

  • The live intermediary behavior still needs deployment-level validation; source review confirms the missing protocol ping, but not that 25 seconds is the ideal cadence for every proxy/Tailscale path.
  • Open Add gateway health connection telemetry #74783 overlaps the same ping ownership surface, so merging both branches without adjustment could create duplicate protocol ping loops.

Codex review notes: model gpt-5.5, reasoning high; reviewed against e0cc374b07c1.

@BunsDev
Copy link
Copy Markdown
Member

BunsDev commented May 2, 2026

Closing as superseded by #76111, which merged the Gateway WebSocket protocol ping keepalive on the authenticated connection lifecycle and also fixed the related Control UI Stop-after-reconnect path for #70991. Thanks for the focused report and patch; the canonical merged fix is #76111.

@BunsDev BunsDev closed this May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: WebSocket webchat connections drop with code 1006 during long tool calls — no server-side ping/keepalive

3 participants