Skip to content

perf(proxy): WebSocket keepalive ping prevents middlebox idle drops#442

Merged
icebear0828 merged 3 commits intodevfrom
ws-pool-keepalive-ping
May 5, 2026
Merged

perf(proxy): WebSocket keepalive ping prevents middlebox idle drops#442
icebear0828 merged 3 commits intodevfrom
ws-pool-keepalive-ping

Conversation

@icebear0828
Copy link
Copy Markdown
Owner

Summary

  • Pooled WebSocket connections were silently RST'd by upstream LB / NAT / firewall idle timeouts (code=1006), forcing a fresh connection against a different backend on the next turn and dragging the prompt cache hit rate back to 5-9%.
  • Send a ws-level ping frame every 25s (configurable via pingIntervalMs, set 0 to disable) so the middlebox keeps the NAT / connection-tracker mapping alive.
  • Timer is started in the PersistentWs constructor, cleared in markDead. Skips ping when the underlying WS is no longer OPEN.

Test plan

  • npx vitest run tests/unit/proxy/ws-pool.test.ts (34/34 pass, includes 5 new keepalive cases)
  • npx vitest run tests/integration/ws-pool-reuse.test.ts (6/6 pass against real local ws.Server)
  • npx vitest run tests/unit/proxy/ (232/232 pass)
  • npx tsc --noEmit clean
  • Real upstream (chatgpt.com) over localhost:8080: a single pool slot held 22+ consecutive turns at 88-94% hit rate; only one code=1006 observed in 3000-line log buffer (down from every-few-minutes before).

Notes

  • WsLike interface picks up a ping(): void requirement. Real ws package WebSocket already implements it; only the in-tree MockWs in tests/unit/proxy/ws-pool.test.ts needed updating.
  • 25s default sits below the typical 30-60s idle timeout of common LBs (AWS ALB default 60s) and NAT trackers.

Pooled WSes were silently RST'd by upstream LB / NAT / firewall idle
timeouts after ~30-60s with no traffic, surfacing as code=1006 on the
next turn. Each drop forced a fresh WebSocket against a different
backend instance, losing the prompt cache prefix and dragging hit rates
back to 5-9%.

Send a ws-level ping frame every 25s (configurable, 0 disables) so the
middlebox NAT/connection-tracker keeps the mapping alive. Real-traffic
verification: single pooled WS sustained 22+ consecutive turns at
88-94% hit, vs the prior pattern of single-use WS dying after one
request.
Address review feedback on PR #442:

- sendKeepalivePing returns early when this.busy is true. The active
  stream's data frames already keep the upstream LB / NAT idle timers
  fresh, so emitting a ping during streaming would be redundant
  bandwidth on chatty sessions.
- Strengthen the error-swallow test to assert pingCount=1 after the
  swallowed throw — a bare not.toThrow() would have missed a
  regression that crashes the interval loop after one bad ping.
- Add a regression test for the busy-skip behavior.
- Inline comment on WsLike.ping() to flag the narrowed signature
  versus real ws.WebSocket.ping(data?, mask?, callback?).
Address review feedback #2 on PR #442: detect silently-broken pooled
connections proactively instead of waiting for the next real request
to discover them via code=1006.

Track lastActivityAt — updated by ANY pong or data message from the
peer (both prove the connection is alive). On each ping tick, if
now - lastActivityAt > livenessTimeoutMs, markDead the WS. Default
threshold is 2.5x pingIntervalMs (~62.5s with default 25s ping):
tolerates one missed pong (network blip) but evicts before a third
would tick, at which point the connection is almost certainly dead
and reusing it would cost a real-request cache miss.

Counter-based "missed pings" alternative was rejected: it would
false-positive on healthy streaming sessions where the server sends
data but no separate pong, dragging working connections offline.

E2E verified end-to-end from device a (Mac mini, 192.168.10.2) →
proxy 192.168.10.6:8080 → chatgpt.com via a 10-turn pinned-session
load script with a 70s idle gap between turns 5 and 6. Turn 6 stayed
on the same pooled WS as turns 1-5 and hit 99.6% cache (matching
pre-gap turn 5), with zero liveness-timeout markDead events — the
keepalive pings carried the connection across the LB idle window
unharmed.

WsLike interface gains `on("pong", listener)`. Real ws.WebSocket
already emits "pong" per RFC 6455 §5.5.3.

Tests added (6 new):
- liveness > marks dead when peer stays silent past timeout
- liveness > pong resets the clock
- liveness > data message resets the clock
- liveness > livenessTimeoutMs=0 disables
- liveness > default multiple keeps healthy WS alive across many cycles
- ping > skips while busy (active stream keeps LB alive)
@icebear0828 icebear0828 merged commit 47d8dae into dev May 5, 2026
1 check passed
@icebear0828 icebear0828 deleted the ws-pool-keepalive-ping branch May 5, 2026 06:26
icebear0828 added a commit that referenced this pull request May 5, 2026
The soak check measures `now - dev_HEAD_timestamp >= 24h`, which means
every new merge into dev resets the clock. Under any non-trivial merge
cadence, dev never satisfies the soak gate and master stagnates: PRs
#437/#438/#439/#440/#442 all stacked on dev for a week with no
promotion.

Add a `force_skip_soak` boolean input to workflow_dispatch (default
false). Schedule cron remains untouched and continues to enforce the
24h rule. Only manual triggers can bypass, and only when the operator
explicitly sets the input to true — intended for sync-back / merge
commits whose content has actually been on dev long enough but whose
HEAD timestamp is misleadingly fresh.

Test plan: yaml syntax verified via js-yaml. Functional verification
will be the next manual workflow_dispatch run with the input set.

Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>
icebear0828 added a commit that referenced this pull request May 5, 2026
…442)

* perf(proxy): WebSocket keepalive ping prevents middlebox idle drops

Pooled WSes were silently RST'd by upstream LB / NAT / firewall idle
timeouts after ~30-60s with no traffic, surfacing as code=1006 on the
next turn. Each drop forced a fresh WebSocket against a different
backend instance, losing the prompt cache prefix and dragging hit rates
back to 5-9%.

Send a ws-level ping frame every 25s (configurable, 0 disables) so the
middlebox NAT/connection-tracker keeps the mapping alive. Real-traffic
verification: single pooled WS sustained 22+ consecutive turns at
88-94% hit, vs the prior pattern of single-use WS dying after one
request.

* perf(proxy): skip keepalive ping while WS is busy + harden tests

Address review feedback on PR #442:

- sendKeepalivePing returns early when this.busy is true. The active
  stream's data frames already keep the upstream LB / NAT idle timers
  fresh, so emitting a ping during streaming would be redundant
  bandwidth on chatty sessions.
- Strengthen the error-swallow test to assert pingCount=1 after the
  swallowed throw — a bare not.toThrow() would have missed a
  regression that crashes the interval loop after one bad ping.
- Add a regression test for the busy-skip behavior.
- Inline comment on WsLike.ping() to flag the narrowed signature
  versus real ws.WebSocket.ping(data?, mask?, callback?).

* perf(proxy): add WS liveness check (pong/message tracking)

Address review feedback #2 on PR #442: detect silently-broken pooled
connections proactively instead of waiting for the next real request
to discover them via code=1006.

Track lastActivityAt — updated by ANY pong or data message from the
peer (both prove the connection is alive). On each ping tick, if
now - lastActivityAt > livenessTimeoutMs, markDead the WS. Default
threshold is 2.5x pingIntervalMs (~62.5s with default 25s ping):
tolerates one missed pong (network blip) but evicts before a third
would tick, at which point the connection is almost certainly dead
and reusing it would cost a real-request cache miss.

Counter-based "missed pings" alternative was rejected: it would
false-positive on healthy streaming sessions where the server sends
data but no separate pong, dragging working connections offline.

E2E verified end-to-end from device a (Mac mini, 192.168.10.2) →
proxy 192.168.10.6:8080 → chatgpt.com via a 10-turn pinned-session
load script with a 70s idle gap between turns 5 and 6. Turn 6 stayed
on the same pooled WS as turns 1-5 and hit 99.6% cache (matching
pre-gap turn 5), with zero liveness-timeout markDead events — the
keepalive pings carried the connection across the LB idle window
unharmed.

WsLike interface gains `on("pong", listener)`. Real ws.WebSocket
already emits "pong" per RFC 6455 §5.5.3.

Tests added (6 new):
- liveness > marks dead when peer stays silent past timeout
- liveness > pong resets the clock
- liveness > data message resets the clock
- liveness > livenessTimeoutMs=0 disables
- liveness > default multiple keeps healthy WS alive across many cycles
- ping > skips while busy (active stream keeps LB alive)

---------

Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>
icebear0828 added a commit that referenced this pull request May 5, 2026
The soak check measures `now - dev_HEAD_timestamp >= 24h`, which means
every new merge into dev resets the clock. Under any non-trivial merge
cadence, dev never satisfies the soak gate and master stagnates: PRs
#437/#438/#439/#440/#442 all stacked on dev for a week with no
promotion.

Add a `force_skip_soak` boolean input to workflow_dispatch (default
false). Schedule cron remains untouched and continues to enforce the
24h rule. Only manual triggers can bypass, and only when the operator
explicitly sets the input to true — intended for sync-back / merge
commits whose content has actually been on dev long enough but whose
HEAD timestamp is misleadingly fresh.

Test plan: yaml syntax verified via js-yaml. Functional verification
will be the next manual workflow_dispatch run with the input set.

Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant