Skip to content

fix(gateway): add WebSocket ping keepalive to prevent idle connection drops#1601

Open
BingqingLyu wants to merge 1 commit intomainfrom
fork-pr-56668-fix-ws-ping-keepalive
Open

fix(gateway): add WebSocket ping keepalive to prevent idle connection drops#1601
BingqingLyu wants to merge 1 commit intomainfrom
fork-pr-56668-fix-ws-ping-keepalive

Conversation

@BingqingLyu
Copy link
Copy Markdown
Owner

@BingqingLyu BingqingLyu commented Apr 27, 2026

Summary

  • Problem: WebSocket webchat connections silently drop (close code 1006) during long tool call chains because the server sends zero WS-level ping frames. Network intermediaries treat the connection as idle and kill it.
  • Why it matters: Users lose live message delivery during tool execution; buffered messages only flood in when a new connection opens on next user action.
  • What changed: Added a 25s server-side WebSocket ping interval per connection in the gateway WS handler, cleared on close.
  • What did NOT change (scope boundary): No client-side changes, no new config options, no protocol changes. The existing application-level tick broadcast is untouched.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause / Regression History (if applicable)

  • Root cause: The gateway WebSocketServer never calls socket.ping(). The existing keepalive is an application-level JSON "tick" event broadcast (every 30s), which is a regular data frame — not a WS protocol ping. Some intermediaries only honor WS-level ping/pong for idle detection. Additionally, the tick broadcast uses dropIfSlow: true, so slow clients can miss ticks entirely, leaving the connection truly idle.
  • Missing detection / guardrail: No WS ping/pong was ever implemented server-side.
  • Prior context: The TICK_INTERVAL_MS application-level keepalive was the only mechanism; WS-level pings were never added.
  • Why this regressed now: Not a regression — this was never implemented. Became visible with longer tool call chains that create 30-60s idle windows.
  • If unknown, what was ruled out: N/A

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • If no new test is added, why not: The fix is a single setInterval + clearInterval — the 20 existing ws-connection tests pass and verify connection lifecycle. The actual bug (intermediary killing idle connections) is a network-level behavior not reproducible in unit tests.

User-visible / Behavior Changes

WebSocket connections now send protocol-level ping frames every 25 seconds, preventing network intermediaries from dropping idle connections during long tool call chains.

Diagram (if applicable)

Before:
[tool call starts] -> [30-60s idle on WS] -> [intermediary kills connection] -> [user sends msg] -> [new conn] -> [message flood]

After:
[tool call starts] -> [ping every 25s keeps conn alive] -> [messages delivered live]

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No (WS ping is a control frame on an existing connection)
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: Node 22+
  • Integration/channel (if any): Webchat / Control UI dashboard

Steps

  1. Open dashboard webchat
  2. Send a message that triggers multiple tool calls (30-60s)
  3. Observe: connection stays alive, messages delivered live

Expected

  • WebSocket connection stays alive during tool execution

Actual

  • (Before fix) Connection drops silently with code 1006

Evidence

  • Failing test/log before + passing after
  • 20/20 existing ws-connection tests pass
  • Gateway logs from issue show [ws] webchat disconnected code=1006 — 12 disconnects in 17 minutes

Human Verification (required)

  • Verified scenarios: All 20 ws-connection tests pass; format check passes
  • Edge cases checked: close() clears the ping interval; socket.ping() is wrapped in try/catch for safety
  • What you did not verify: Live network intermediary behavior (requires deployed gateway + real Tailscale/proxy setup)

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: Minimal additional network traffic (one 2-byte ping frame per connection every 25s)
    • Mitigation: Standard WS best practice per RFC 6455 §5.5.2; negligible overhead

Made with Cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: WebSocket webchat connections drop with code 1006 during long tool calls — no server-side ping/keepalive

2 participants