perf(proxy): WebSocket keepalive ping prevents middlebox idle drops by icebear0828 · Pull Request #442 · icebear0828/codex-proxy

icebear0828 · 2026-05-05T04:01:16Z

Summary

Pooled WebSocket connections were silently RST'd by upstream LB / NAT / firewall idle timeouts (code=1006), forcing a fresh connection against a different backend on the next turn and dragging the prompt cache hit rate back to 5-9%.
Send a ws-level ping frame every 25s (configurable via pingIntervalMs, set 0 to disable) so the middlebox keeps the NAT / connection-tracker mapping alive.
Timer is started in the PersistentWs constructor, cleared in markDead. Skips ping when the underlying WS is no longer OPEN.

Test plan

npx vitest run tests/unit/proxy/ws-pool.test.ts (34/34 pass, includes 5 new keepalive cases)
npx vitest run tests/integration/ws-pool-reuse.test.ts (6/6 pass against real local ws.Server)
npx vitest run tests/unit/proxy/ (232/232 pass)
npx tsc --noEmit clean
Real upstream (chatgpt.com) over localhost:8080: a single pool slot held 22+ consecutive turns at 88-94% hit rate; only one code=1006 observed in 3000-line log buffer (down from every-few-minutes before).

Notes

WsLike interface picks up a ping(): void requirement. Real ws package WebSocket already implements it; only the in-tree MockWs in tests/unit/proxy/ws-pool.test.ts needed updating.
25s default sits below the typical 30-60s idle timeout of common LBs (AWS ALB default 60s) and NAT trackers.

Pooled WSes were silently RST'd by upstream LB / NAT / firewall idle timeouts after ~30-60s with no traffic, surfacing as code=1006 on the next turn. Each drop forced a fresh WebSocket against a different backend instance, losing the prompt cache prefix and dragging hit rates back to 5-9%. Send a ws-level ping frame every 25s (configurable, 0 disables) so the middlebox NAT/connection-tracker keeps the mapping alive. Real-traffic verification: single pooled WS sustained 22+ consecutive turns at 88-94% hit, vs the prior pattern of single-use WS dying after one request.

Address review feedback on PR #442: - sendKeepalivePing returns early when this.busy is true. The active stream's data frames already keep the upstream LB / NAT idle timers fresh, so emitting a ping during streaming would be redundant bandwidth on chatty sessions. - Strengthen the error-swallow test to assert pingCount=1 after the swallowed throw — a bare not.toThrow() would have missed a regression that crashes the interval loop after one bad ping. - Add a regression test for the busy-skip behavior. - Inline comment on WsLike.ping() to flag the narrowed signature versus real ws.WebSocket.ping(data?, mask?, callback?).

Address review feedback #2 on PR #442: detect silently-broken pooled connections proactively instead of waiting for the next real request to discover them via code=1006. Track lastActivityAt — updated by ANY pong or data message from the peer (both prove the connection is alive). On each ping tick, if now - lastActivityAt > livenessTimeoutMs, markDead the WS. Default threshold is 2.5x pingIntervalMs (~62.5s with default 25s ping): tolerates one missed pong (network blip) but evicts before a third would tick, at which point the connection is almost certainly dead and reusing it would cost a real-request cache miss. Counter-based "missed pings" alternative was rejected: it would false-positive on healthy streaming sessions where the server sends data but no separate pong, dragging working connections offline. E2E verified end-to-end from device a (Mac mini, 192.168.10.2) → proxy 192.168.10.6:8080 → chatgpt.com via a 10-turn pinned-session load script with a 70s idle gap between turns 5 and 6. Turn 6 stayed on the same pooled WS as turns 1-5 and hit 99.6% cache (matching pre-gap turn 5), with zero liveness-timeout markDead events — the keepalive pings carried the connection across the LB idle window unharmed. WsLike interface gains `on("pong", listener)`. Real ws.WebSocket already emits "pong" per RFC 6455 §5.5.3. Tests added (6 new): - liveness > marks dead when peer stays silent past timeout - liveness > pong resets the clock - liveness > data message resets the clock - liveness > livenessTimeoutMs=0 disables - liveness > default multiple keeps healthy WS alive across many cycles - ping > skips while busy (active stream keeps LB alive)

The soak check measures `now - dev_HEAD_timestamp >= 24h`, which means every new merge into dev resets the clock. Under any non-trivial merge cadence, dev never satisfies the soak gate and master stagnates: PRs #437/#438/#439/#440/#442 all stacked on dev for a week with no promotion. Add a `force_skip_soak` boolean input to workflow_dispatch (default false). Schedule cron remains untouched and continues to enforce the 24h rule. Only manual triggers can bypass, and only when the operator explicitly sets the input to true — intended for sync-back / merge commits whose content has actually been on dev long enough but whose HEAD timestamp is misleadingly fresh. Test plan: yaml syntax verified via js-yaml. Functional verification will be the next manual workflow_dispatch run with the input set. Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>

…442) * perf(proxy): WebSocket keepalive ping prevents middlebox idle drops Pooled WSes were silently RST'd by upstream LB / NAT / firewall idle timeouts after ~30-60s with no traffic, surfacing as code=1006 on the next turn. Each drop forced a fresh WebSocket against a different backend instance, losing the prompt cache prefix and dragging hit rates back to 5-9%. Send a ws-level ping frame every 25s (configurable, 0 disables) so the middlebox NAT/connection-tracker keeps the mapping alive. Real-traffic verification: single pooled WS sustained 22+ consecutive turns at 88-94% hit, vs the prior pattern of single-use WS dying after one request. * perf(proxy): skip keepalive ping while WS is busy + harden tests Address review feedback on PR #442: - sendKeepalivePing returns early when this.busy is true. The active stream's data frames already keep the upstream LB / NAT idle timers fresh, so emitting a ping during streaming would be redundant bandwidth on chatty sessions. - Strengthen the error-swallow test to assert pingCount=1 after the swallowed throw — a bare not.toThrow() would have missed a regression that crashes the interval loop after one bad ping. - Add a regression test for the busy-skip behavior. - Inline comment on WsLike.ping() to flag the narrowed signature versus real ws.WebSocket.ping(data?, mask?, callback?). * perf(proxy): add WS liveness check (pong/message tracking) Address review feedback #2 on PR #442: detect silently-broken pooled connections proactively instead of waiting for the next real request to discover them via code=1006. Track lastActivityAt — updated by ANY pong or data message from the peer (both prove the connection is alive). On each ping tick, if now - lastActivityAt > livenessTimeoutMs, markDead the WS. Default threshold is 2.5x pingIntervalMs (~62.5s with default 25s ping): tolerates one missed pong (network blip) but evicts before a third would tick, at which point the connection is almost certainly dead and reusing it would cost a real-request cache miss. Counter-based "missed pings" alternative was rejected: it would false-positive on healthy streaming sessions where the server sends data but no separate pong, dragging working connections offline. E2E verified end-to-end from device a (Mac mini, 192.168.10.2) → proxy 192.168.10.6:8080 → chatgpt.com via a 10-turn pinned-session load script with a 70s idle gap between turns 5 and 6. Turn 6 stayed on the same pooled WS as turns 1-5 and hit 99.6% cache (matching pre-gap turn 5), with zero liveness-timeout markDead events — the keepalive pings carried the connection across the LB idle window unharmed. WsLike interface gains `on("pong", listener)`. Real ws.WebSocket already emits "pong" per RFC 6455 §5.5.3. Tests added (6 new): - liveness > marks dead when peer stays silent past timeout - liveness > pong resets the clock - liveness > data message resets the clock - liveness > livenessTimeoutMs=0 disables - liveness > default multiple keeps healthy WS alive across many cycles - ping > skips while busy (active stream keeps LB alive) --------- Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>

The soak check measures `now - dev_HEAD_timestamp >= 24h`, which means every new merge into dev resets the clock. Under any non-trivial merge cadence, dev never satisfies the soak gate and master stagnates: PRs #437/#438/#439/#440/#442 all stacked on dev for a week with no promotion. Add a `force_skip_soak` boolean input to workflow_dispatch (default false). Schedule cron remains untouched and continues to enforce the 24h rule. Only manual triggers can bypass, and only when the operator explicitly sets the input to true — intended for sync-back / merge commits whose content has actually been on dev long enough but whose HEAD timestamp is misleadingly fresh. Test plan: yaml syntax verified via js-yaml. Functional verification will be the next manual workflow_dispatch run with the input set. Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>

icebear0828 added 3 commits May 4, 2026 20:59

icebear0828 merged commit 47d8dae into dev May 5, 2026
1 check passed

icebear0828 deleted the ws-pool-keepalive-ping branch May 5, 2026 06:26

icebear0828 mentioned this pull request May 5, 2026

ci(promote): add force_skip_soak input to break the soak deadlock #443

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(proxy): WebSocket keepalive ping prevents middlebox idle drops#442

perf(proxy): WebSocket keepalive ping prevents middlebox idle drops#442
icebear0828 merged 3 commits intodevfrom
ws-pool-keepalive-ping

icebear0828 commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

icebear0828 commented May 5, 2026

Summary

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant