perf(proxy): WebSocket keepalive ping prevents middlebox idle drops#442
Merged
icebear0828 merged 3 commits intodevfrom May 5, 2026
Merged
perf(proxy): WebSocket keepalive ping prevents middlebox idle drops#442icebear0828 merged 3 commits intodevfrom
icebear0828 merged 3 commits intodevfrom
Conversation
Pooled WSes were silently RST'd by upstream LB / NAT / firewall idle timeouts after ~30-60s with no traffic, surfacing as code=1006 on the next turn. Each drop forced a fresh WebSocket against a different backend instance, losing the prompt cache prefix and dragging hit rates back to 5-9%. Send a ws-level ping frame every 25s (configurable, 0 disables) so the middlebox NAT/connection-tracker keeps the mapping alive. Real-traffic verification: single pooled WS sustained 22+ consecutive turns at 88-94% hit, vs the prior pattern of single-use WS dying after one request.
Address review feedback on PR #442: - sendKeepalivePing returns early when this.busy is true. The active stream's data frames already keep the upstream LB / NAT idle timers fresh, so emitting a ping during streaming would be redundant bandwidth on chatty sessions. - Strengthen the error-swallow test to assert pingCount=1 after the swallowed throw — a bare not.toThrow() would have missed a regression that crashes the interval loop after one bad ping. - Add a regression test for the busy-skip behavior. - Inline comment on WsLike.ping() to flag the narrowed signature versus real ws.WebSocket.ping(data?, mask?, callback?).
Address review feedback #2 on PR #442: detect silently-broken pooled connections proactively instead of waiting for the next real request to discover them via code=1006. Track lastActivityAt — updated by ANY pong or data message from the peer (both prove the connection is alive). On each ping tick, if now - lastActivityAt > livenessTimeoutMs, markDead the WS. Default threshold is 2.5x pingIntervalMs (~62.5s with default 25s ping): tolerates one missed pong (network blip) but evicts before a third would tick, at which point the connection is almost certainly dead and reusing it would cost a real-request cache miss. Counter-based "missed pings" alternative was rejected: it would false-positive on healthy streaming sessions where the server sends data but no separate pong, dragging working connections offline. E2E verified end-to-end from device a (Mac mini, 192.168.10.2) → proxy 192.168.10.6:8080 → chatgpt.com via a 10-turn pinned-session load script with a 70s idle gap between turns 5 and 6. Turn 6 stayed on the same pooled WS as turns 1-5 and hit 99.6% cache (matching pre-gap turn 5), with zero liveness-timeout markDead events — the keepalive pings carried the connection across the LB idle window unharmed. WsLike interface gains `on("pong", listener)`. Real ws.WebSocket already emits "pong" per RFC 6455 §5.5.3. Tests added (6 new): - liveness > marks dead when peer stays silent past timeout - liveness > pong resets the clock - liveness > data message resets the clock - liveness > livenessTimeoutMs=0 disables - liveness > default multiple keeps healthy WS alive across many cycles - ping > skips while busy (active stream keeps LB alive)
2 tasks
icebear0828
added a commit
that referenced
this pull request
May 5, 2026
The soak check measures `now - dev_HEAD_timestamp >= 24h`, which means every new merge into dev resets the clock. Under any non-trivial merge cadence, dev never satisfies the soak gate and master stagnates: PRs #437/#438/#439/#440/#442 all stacked on dev for a week with no promotion. Add a `force_skip_soak` boolean input to workflow_dispatch (default false). Schedule cron remains untouched and continues to enforce the 24h rule. Only manual triggers can bypass, and only when the operator explicitly sets the input to true — intended for sync-back / merge commits whose content has actually been on dev long enough but whose HEAD timestamp is misleadingly fresh. Test plan: yaml syntax verified via js-yaml. Functional verification will be the next manual workflow_dispatch run with the input set. Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>
icebear0828
added a commit
that referenced
this pull request
May 5, 2026
…442) * perf(proxy): WebSocket keepalive ping prevents middlebox idle drops Pooled WSes were silently RST'd by upstream LB / NAT / firewall idle timeouts after ~30-60s with no traffic, surfacing as code=1006 on the next turn. Each drop forced a fresh WebSocket against a different backend instance, losing the prompt cache prefix and dragging hit rates back to 5-9%. Send a ws-level ping frame every 25s (configurable, 0 disables) so the middlebox NAT/connection-tracker keeps the mapping alive. Real-traffic verification: single pooled WS sustained 22+ consecutive turns at 88-94% hit, vs the prior pattern of single-use WS dying after one request. * perf(proxy): skip keepalive ping while WS is busy + harden tests Address review feedback on PR #442: - sendKeepalivePing returns early when this.busy is true. The active stream's data frames already keep the upstream LB / NAT idle timers fresh, so emitting a ping during streaming would be redundant bandwidth on chatty sessions. - Strengthen the error-swallow test to assert pingCount=1 after the swallowed throw — a bare not.toThrow() would have missed a regression that crashes the interval loop after one bad ping. - Add a regression test for the busy-skip behavior. - Inline comment on WsLike.ping() to flag the narrowed signature versus real ws.WebSocket.ping(data?, mask?, callback?). * perf(proxy): add WS liveness check (pong/message tracking) Address review feedback #2 on PR #442: detect silently-broken pooled connections proactively instead of waiting for the next real request to discover them via code=1006. Track lastActivityAt — updated by ANY pong or data message from the peer (both prove the connection is alive). On each ping tick, if now - lastActivityAt > livenessTimeoutMs, markDead the WS. Default threshold is 2.5x pingIntervalMs (~62.5s with default 25s ping): tolerates one missed pong (network blip) but evicts before a third would tick, at which point the connection is almost certainly dead and reusing it would cost a real-request cache miss. Counter-based "missed pings" alternative was rejected: it would false-positive on healthy streaming sessions where the server sends data but no separate pong, dragging working connections offline. E2E verified end-to-end from device a (Mac mini, 192.168.10.2) → proxy 192.168.10.6:8080 → chatgpt.com via a 10-turn pinned-session load script with a 70s idle gap between turns 5 and 6. Turn 6 stayed on the same pooled WS as turns 1-5 and hit 99.6% cache (matching pre-gap turn 5), with zero liveness-timeout markDead events — the keepalive pings carried the connection across the LB idle window unharmed. WsLike interface gains `on("pong", listener)`. Real ws.WebSocket already emits "pong" per RFC 6455 §5.5.3. Tests added (6 new): - liveness > marks dead when peer stays silent past timeout - liveness > pong resets the clock - liveness > data message resets the clock - liveness > livenessTimeoutMs=0 disables - liveness > default multiple keeps healthy WS alive across many cycles - ping > skips while busy (active stream keeps LB alive) --------- Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>
icebear0828
added a commit
that referenced
this pull request
May 5, 2026
The soak check measures `now - dev_HEAD_timestamp >= 24h`, which means every new merge into dev resets the clock. Under any non-trivial merge cadence, dev never satisfies the soak gate and master stagnates: PRs #437/#438/#439/#440/#442 all stacked on dev for a week with no promotion. Add a `force_skip_soak` boolean input to workflow_dispatch (default false). Schedule cron remains untouched and continues to enforce the 24h rule. Only manual triggers can bypass, and only when the operator explicitly sets the input to true — intended for sync-back / merge commits whose content has actually been on dev long enough but whose HEAD timestamp is misleadingly fresh. Test plan: yaml syntax verified via js-yaml. Functional verification will be the next manual workflow_dispatch run with the input set. Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
code=1006), forcing a fresh connection against a different backend on the next turn and dragging the prompt cache hit rate back to 5-9%.pingIntervalMs, set 0 to disable) so the middlebox keeps the NAT / connection-tracker mapping alive.PersistentWsconstructor, cleared inmarkDead. Skips ping when the underlying WS is no longer OPEN.Test plan
npx vitest run tests/unit/proxy/ws-pool.test.ts(34/34 pass, includes 5 new keepalive cases)npx vitest run tests/integration/ws-pool-reuse.test.ts(6/6 pass against real localws.Server)npx vitest run tests/unit/proxy/(232/232 pass)npx tsc --noEmitcleanlocalhost:8080: a single pool slot held 22+ consecutive turns at 88-94% hit rate; only onecode=1006observed in 3000-line log buffer (down from every-few-minutes before).Notes
WsLikeinterface picks up aping(): voidrequirement. RealwspackageWebSocketalready implements it; only the in-treeMockWsintests/unit/proxy/ws-pool.test.tsneeded updating.