fix(gateway): add WebSocket ping keepalive to prevent idle connection drops#56668
fix(gateway): add WebSocket ping keepalive to prevent idle connection drops#56668joelagnel wants to merge 1 commit intoopenclaw:mainfrom
Conversation
… drops Made-with: Cursor
Greptile SummaryAdds a 25-second server-side WebSocket ping interval (RFC 6455 protocol-level ping, not an application data frame) to prevent network intermediaries from dropping idle connections during long tool call chains. The interval is started on each connection and correctly cleared via Confidence Score: 5/5Safe to merge — the change is minimal, well-scoped, and follows correct cleanup patterns. A single setInterval/clearInterval pair with correct teardown on all close paths. No logic errors or P1 issues found. The one P2 finding is an optional comment improvement on the empty catch block. No files require special attention. Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/gateway/server/ws-connection.ts
Line: 164-166
Comment:
**Silent ping error discard**
The empty `catch {}` swallows all errors from `socket.ping()` without any log or diagnostic signal. If the socket enters an unexpected error state before `socket.once("error", ...)` fires, repeated silent failures would be invisible in logs. A brief inline comment distinguishing the expected close-race case from unexpected errors would help during incident triage.
```suggestion
try {
socket.ping();
} catch {
// Socket may be in CLOSING state; clearInterval in close() handles cleanup
}
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (2): Last reviewed commit: "fix(gateway): add WebSocket ping keepali..." | Re-trigger Greptile |
PR Review: WebSocket ping keepaliveApprove. Clean fix for a real-world gateway stability issue. Key points:
Note: The 20 existing ws-connection tests passing is reassuring. The actual intermediary behavior (killing idle connections) can't be unit-tested but the fix is straightforward and correct. LGTM. Reviewed as part of OpenClaw community contribution tracking |
78cfac0 to
98c3c93
Compare
|
No changes during recent force push (I just rebased/restored my fixes branch which is triggering CI again). Patch was LGTM'd already and safe to merge. |
|
@greptileai do another review |
|
This pull request has been automatically marked as stale due to inactivity. |
|
Codex review: needs changes before merge. What this changes: The PR adds a 25-second per-connection Required change before merge: A narrow automated repair can add the missing changelog entry; the final merge versus #74783 still needs maintainer judgment after that cleanup. Security review: Security review cleared: The diff only sends WebSocket control frames on existing gateway sockets and does not change dependencies, CI, secrets, permissions, package resolution, install scripts, or command execution paths. Review findings:
Review detailsBest possible solution: Keep exactly one gateway-owned WebSocket protocol ping path. Merge this focused keepalive after adding the changelog entry if maintainers want the minimal idle-drop fix, or land #74783 as canonical if they want ping/pong health telemetry and a shared maintenance-loop owner. Do we have a high-confidence way to reproduce the issue? Yes. The linked bug gives concrete WebChat long-tool-call steps and code 1006 logs, and current-main source confirms the gateway only has JSON Is this the best way to solve the issue? Unclear. This per-connection timer is the narrowest maintainable fix for idle connection drops and matches the Full review comments:
Overall correctness: patch is correct Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against e0cc374b07c1. |
Summary
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Root Cause / Regression History (if applicable)
socket.ping(). The existing keepalive is an application-level JSON"tick"event broadcast (every 30s), which is a regular data frame — not a WS protocol ping. Some intermediaries only honor WS-level ping/pong for idle detection. Additionally, the tick broadcast usesdropIfSlow: true, so slow clients can miss ticks entirely, leaving the connection truly idle.TICK_INTERVAL_MSapplication-level keepalive was the only mechanism; WS-level pings were never added.Regression Test Plan (if applicable)
setInterval+clearInterval— the 20 existing ws-connection tests pass and verify connection lifecycle. The actual bug (intermediary killing idle connections) is a network-level behavior not reproducible in unit tests.User-visible / Behavior Changes
WebSocket connections now send protocol-level ping frames every 25 seconds, preventing network intermediaries from dropping idle connections during long tool call chains.
Diagram (if applicable)
Security Impact (required)
Repro + Verification
Environment
Steps
Expected
Actual
Evidence
[ws] webchat disconnected code=1006— 12 disconnects in 17 minutesHuman Verification (required)
close()clears the ping interval;socket.ping()is wrapped in try/catch for safetyReview Conversations
Compatibility / Migration
Risks and Mitigations
Made with Cursor