Skip to content

feat(proxy): WebSocket connection pool fixes prompt-cache hit-rate jitter#440

Merged
icebear0828 merged 4 commits intodevfrom
feat/ws-connection-pool
May 4, 2026
Merged

feat(proxy): WebSocket connection pool fixes prompt-cache hit-rate jitter#440
icebear0828 merged 4 commits intodevfrom
feat/ws-connection-pool

Conversation

@icebear0828
Copy link
Copy Markdown
Owner

Summary

修上游 prompt cache 命中率在 5%~99% 之间剧烈抖动的问题。根因:codex-proxy 对每个上游 WS 请求都 new WebSocket(url),OpenAI 的 WS gateway 按"连接 ID"做 LB hash → 同一逻辑会话被随机扔到不同 backend,每个 backend 各自缓存了不同长度的前缀,导致 cached_tokens 反复出现 1920 / 2432 / 24448 / 40320 / 47488 等离散"checkpoint"。

参考真 codex CLI(core/src/client.rs:802WebsocketSession.connection)的策略,引入 per-(entryId, conversationId) 持久 WS 连接池:同会话同账号的连续 turn 全部走同一条物理 WS,让上游 LB 钉到同一 backend,prompt cache 稳态命中。

Architecture

  • src/proxy/ws-pool.ts(新):PersistentWs + WsConnectionPool + WsReusedConnectionError
    • Pool key: \${entryId}:\${conversationId},覆盖显式/隐式续链
    • 单 WS strict 串行:busy 时旁路开 one-shot 新连接(不排队,避免死锁)
    • 无 idle TTL:连接保持开放,仅在自然死亡 / 触达 max_age_ms(55 min,比 server 60 min 硬限提前关)/ 账号状态变化(evictByEntryId)时清理
    • 复用失败语义:pre-response close → WsReusedConnectionError → 单次 fallback 一次性 WS;流中段 close → 抛给 client(不 retry,已收到部分数据)
    • 账号状态钩子account-pool.tsmarkRateLimited / markStatus(non-active) / removeAccount / updateToken(refresh)时级联 evictByEntryId,避免老 WS 携带的 access_token 被复用
    • 进程退出SIGTERM/SIGINT 钩子追加 wsPool.shutdown()
  • src/proxy/ws-transport.ts:抽出 openOneShotWs(旧路径),createWebSocketResponse 接 optional poolCtx? 参数;带 ctx 时先尝试 pool.acquire,否则回到 one-shot。pool/factory 任何故障都 fallback 到 one-shot,绝不污染 caller
  • src/proxy/codex-api.ts / src/routes/shared/proxy-handler.ts:透传 poolCtxbuildPoolCtx() 在 implicit resume / explicit prev_resp_id 等 WS 路径下生成 ctx
  • 配置 ws_pool: { enabled: true (默认), max_age_ms: 3300000, max_per_account: 8 }:默认开(bugfix 性质),enabled: false + 重启即可回滚

Observability

入口日志新增 ws= 字段:

  • ws=reuse:abc12345 — 池命中复用
  • ws=new:def67890 — 池 miss 新建
  • ws=bypass(busy|cap|dead|disabled|no_key|factory_error) — 旁路一次性
  • ws=retry-after-stale-reuse:abc12345 — 复用失败自动单次重试

配合现有 rid/conv/key/prev/resume/hit 字段可直接观察池行为 + 命中率归因。

Test Plan

  • npm test — 1702 passed (+1 skipped),零回归
  • npx tsc --noEmit — clean
  • npm run build — vite + tsc OK
  • tests/unit/proxy/ws-pool.test.ts — 26 个单测覆盖 PersistentWs + WsConnectionPool 完整生命周期(acquire/release/idle/dead/cap/abort/server limit/upgrade headers/...)
  • tests/integration/ws-pool-reuse.test.ts — 起本地 ws.Server 验证:
    • 5 turn 同会话 → server 仅 1 次 connection ✓
    • 不同 conv → 各自 1 个连接 ✓
    • server 主动 close → 池立即驱逐 + 下次 acquire 重连 ✓
    • enabled: false → 退化为 one-shot ✓
    • evictByEntryId → 池清空 + 后续重建 ✓
    • 不传 poolCtx → 行为 100% 等价于今天 ✓
  • 现有 ws-transport*.test.ts 不动且全过(向后兼容)
  • 端到端命中率 live 验证:重启 dev 跑长会话,目标稳态 ≥ 90%(旧基线 5%~99% 抖动)

Notes

icebear0828 added a commit that referenced this pull request May 4, 2026
PR #440 加了 config schema 字段 ws_pool 但忘了真正读取它 —— `getWsPool()`
永远用 DEFAULT_WS_POOL_CONFIG,用户改 `ws_pool.enabled: false` 完全无效,
回滚策略破坏。

修复:startServer() 在加载 cfg 后立即调 setWsPoolConfig() 用用户配置替换
单例。同时把 shutdown 钩子的 dynamic import 一起改成 static(同模块复用)。

新增 3 个测试覆盖单例 wiring 链路:
- enabled:false → acquire 永远 bypass(disabled),factory 不调
- 默认 getWsPool() 工作正常
- setWsPoolConfig 后调覆盖前调

测试:1705 passed (+1 skipped)
新增 PersistentWs + WsConnectionPool + WsReusedConnectionError,准备解决
上游 WS gateway 按连接 ID 哈希路由导致的 prompt cache 命中率抖动
(5%~99% bimodal)。本 commit 纯模块新增 + 26 个单测,未接线,零回归风险。

设计要点:
- Pool key = `${entryId}:${conversationId}`,覆盖显式 + 隐式续链
- 单 WS strict 串行(codex 协议要求),busy 时 acquire 返回 bypass 让 caller
  走旧路径(不排队,避免死锁)
- 无 idle TTL;max_age=55min 留 5min 缓冲(server 60min 硬限)
- 死连/abort/账号状态变化级联清理
- 复用失败(pre-response)抛 WsReusedConnectionError,caller 可单次 retry;
  流中段失败走 controller.error 不 retry(client 已收到部分数据)

config schema 新增 ws_pool: { enabled: true, max_age_ms, max_per_account }
…dler

接线 ws-pool 到主请求链路。

ws-transport.ts:
- 抽出 openOneShotWs 保持原有 one-shot 语义(向后兼容,旧调用方零变更)
- createWebSocketResponse 增加 poolCtx? 参数;带 ctx 时先尝试 pool.acquire
  → 命中复用 PersistentWs.send;遇 WsReusedConnectionError 单次回退 one-shot
- pool 自身故障(factory 抛错)也会 fallback 到 one-shot,不污染调用方

codex-api.ts:
- createResponse / createResponseViaWebSocket 透传 poolCtx 到 createWebSocketResponse
- HTTP 路径完全不受影响

proxy-handler.ts:
- buildPoolCtx() 根据 useWebSocket + chainConversationId 生成 poolKey
  = `${entryId}:${chainConversationId}`,仅 WS 路径生效
- 主流程 + handleNonStreaming 的 empty-response retry 都用同一个 builder

account-pool.ts:
- markStatus(non-active) / markRateLimited / removeAccount / updateToken(refresh
  完成)级联调 evictByEntryId 关闭该账号所有池中 WS。理由:refresh 后老 WS
  携带的 access_token 已失效;其他状态变化下账号本身不可用,留池只浪费
- 用 dynamic import 隔离 ws-pool,避免 account-pool 单测必须拉 proxy 层

src/index.ts: SIGINT/SIGTERM shutdown 钩子追加 wsPool.shutdown() 优雅关闭

集成测:tests/integration/ws-pool-reuse.test.ts 起本地 ws.Server 验证
- 5 turn 同 conv → server 仅 1 次 connection
- 不同 conv → 各自 1 个连接
- server 主动 close → 池立即驱逐,下次 acquire 新建
- enabled:false → 退化为 one-shot(每 turn 新连接)
- evictByEntryId → 池清空 + 后续重建
- 不传 poolCtx → 行为完全等价于今天

测试:1702 passed (+1 skipped),零回归
ws-transport: 新增 WsDispatchDecision 类型 + WsPoolContext.onDecision 回调,
四种决定 (reuse / new / bypass:<reason> / retry-after-stale-reuse) 在 dispatch
时刻一次性 emit 给 caller。

proxy-handler: buildPoolCtx 装上 onDecision listener,对每个 WS 请求 emit
一行 `[fmt] Account E | rid=R | ws=reuse:abc` 之类的日志,便于直接观察
池命中率 + 抖动归因。

CHANGELOG: Unreleased → Fixed 加完整条目,写明问题、原因、配置、回滚。
PR #440 加了 config schema 字段 ws_pool 但忘了真正读取它 —— `getWsPool()`
永远用 DEFAULT_WS_POOL_CONFIG,用户改 `ws_pool.enabled: false` 完全无效,
回滚策略破坏。

修复:startServer() 在加载 cfg 后立即调 setWsPoolConfig() 用用户配置替换
单例。同时把 shutdown 钩子的 dynamic import 一起改成 static(同模块复用)。

新增 3 个测试覆盖单例 wiring 链路:
- enabled:false → acquire 永远 bypass(disabled),factory 不调
- 默认 getWsPool() 工作正常
- setWsPoolConfig 后调覆盖前调

测试:1705 passed (+1 skipped)
@icebear0828 icebear0828 force-pushed the feat/ws-connection-pool branch from bdfd173 to 881926a Compare May 4, 2026 02:07
@icebear0828 icebear0828 merged commit f187423 into dev May 4, 2026
1 check passed
@icebear0828 icebear0828 deleted the feat/ws-connection-pool branch May 4, 2026 02:10
icebear0828 added a commit that referenced this pull request May 5, 2026
The soak check measures `now - dev_HEAD_timestamp >= 24h`, which means
every new merge into dev resets the clock. Under any non-trivial merge
cadence, dev never satisfies the soak gate and master stagnates: PRs
#437/#438/#439/#440/#442 all stacked on dev for a week with no
promotion.

Add a `force_skip_soak` boolean input to workflow_dispatch (default
false). Schedule cron remains untouched and continues to enforce the
24h rule. Only manual triggers can bypass, and only when the operator
explicitly sets the input to true — intended for sync-back / merge
commits whose content has actually been on dev long enough but whose
HEAD timestamp is misleadingly fresh.

Test plan: yaml syntax verified via js-yaml. Functional verification
will be the next manual workflow_dispatch run with the input set.

Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>
icebear0828 added a commit that referenced this pull request May 5, 2026
…tter (#440)

* feat(proxy): add WS connection pool module (no wiring yet)

新增 PersistentWs + WsConnectionPool + WsReusedConnectionError,准备解决
上游 WS gateway 按连接 ID 哈希路由导致的 prompt cache 命中率抖动
(5%~99% bimodal)。本 commit 纯模块新增 + 26 个单测,未接线,零回归风险。

设计要点:
- Pool key = `${entryId}:${conversationId}`,覆盖显式 + 隐式续链
- 单 WS strict 串行(codex 协议要求),busy 时 acquire 返回 bypass 让 caller
  走旧路径(不排队,避免死锁)
- 无 idle TTL;max_age=55min 留 5min 缓冲(server 60min 硬限)
- 死连/abort/账号状态变化级联清理
- 复用失败(pre-response)抛 WsReusedConnectionError,caller 可单次 retry;
  流中段失败走 controller.error 不 retry(client 已收到部分数据)

config schema 新增 ws_pool: { enabled: true, max_age_ms, max_per_account }

* feat(proxy): wire WS connection pool through ws-transport + proxy-handler

接线 ws-pool 到主请求链路。

ws-transport.ts:
- 抽出 openOneShotWs 保持原有 one-shot 语义(向后兼容,旧调用方零变更)
- createWebSocketResponse 增加 poolCtx? 参数;带 ctx 时先尝试 pool.acquire
  → 命中复用 PersistentWs.send;遇 WsReusedConnectionError 单次回退 one-shot
- pool 自身故障(factory 抛错)也会 fallback 到 one-shot,不污染调用方

codex-api.ts:
- createResponse / createResponseViaWebSocket 透传 poolCtx 到 createWebSocketResponse
- HTTP 路径完全不受影响

proxy-handler.ts:
- buildPoolCtx() 根据 useWebSocket + chainConversationId 生成 poolKey
  = `${entryId}:${chainConversationId}`,仅 WS 路径生效
- 主流程 + handleNonStreaming 的 empty-response retry 都用同一个 builder

account-pool.ts:
- markStatus(non-active) / markRateLimited / removeAccount / updateToken(refresh
  完成)级联调 evictByEntryId 关闭该账号所有池中 WS。理由:refresh 后老 WS
  携带的 access_token 已失效;其他状态变化下账号本身不可用,留池只浪费
- 用 dynamic import 隔离 ws-pool,避免 account-pool 单测必须拉 proxy 层

src/index.ts: SIGINT/SIGTERM shutdown 钩子追加 wsPool.shutdown() 优雅关闭

集成测:tests/integration/ws-pool-reuse.test.ts 起本地 ws.Server 验证
- 5 turn 同 conv → server 仅 1 次 connection
- 不同 conv → 各自 1 个连接
- server 主动 close → 池立即驱逐,下次 acquire 新建
- enabled:false → 退化为 one-shot(每 turn 新连接)
- evictByEntryId → 池清空 + 后续重建
- 不传 poolCtx → 行为完全等价于今天

测试:1702 passed (+1 skipped),零回归

* feat(proxy): observability for WS pool decisions + CHANGELOG

ws-transport: 新增 WsDispatchDecision 类型 + WsPoolContext.onDecision 回调,
四种决定 (reuse / new / bypass:<reason> / retry-after-stale-reuse) 在 dispatch
时刻一次性 emit 给 caller。

proxy-handler: buildPoolCtx 装上 onDecision listener,对每个 WS 请求 emit
一行 `[fmt] Account E | rid=R | ws=reuse:abc` 之类的日志,便于直接观察
池命中率 + 抖动归因。

CHANGELOG: Unreleased → Fixed 加完整条目,写明问题、原因、配置、回滚。

* fix(proxy): wire ws_pool config into singleton (self-review of #440)

PR #440 加了 config schema 字段 ws_pool 但忘了真正读取它 —— `getWsPool()`
永远用 DEFAULT_WS_POOL_CONFIG,用户改 `ws_pool.enabled: false` 完全无效,
回滚策略破坏。

修复:startServer() 在加载 cfg 后立即调 setWsPoolConfig() 用用户配置替换
单例。同时把 shutdown 钩子的 dynamic import 一起改成 static(同模块复用)。

新增 3 个测试覆盖单例 wiring 链路:
- enabled:false → acquire 永远 bypass(disabled),factory 不调
- 默认 getWsPool() 工作正常
- setWsPoolConfig 后调覆盖前调

测试:1705 passed (+1 skipped)

---------

Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>
icebear0828 added a commit that referenced this pull request May 5, 2026
The soak check measures `now - dev_HEAD_timestamp >= 24h`, which means
every new merge into dev resets the clock. Under any non-trivial merge
cadence, dev never satisfies the soak gate and master stagnates: PRs
#437/#438/#439/#440/#442 all stacked on dev for a week with no
promotion.

Add a `force_skip_soak` boolean input to workflow_dispatch (default
false). Schedule cron remains untouched and continues to enforce the
24h rule. Only manual triggers can bypass, and only when the operator
explicitly sets the input to true — intended for sync-back / merge
commits whose content has actually been on dev long enough but whose
HEAD timestamp is misleadingly fresh.

Test plan: yaml syntax verified via js-yaml. Functional verification
will be the next manual workflow_dispatch run with the input set.

Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>
hangox added a commit to hangox/codex-proxy that referenced this pull request May 7, 2026
合并上游 icebear0828#440 (WebSocket connection pool) 时遗漏:streaming 路径的
empty-response 换号重试调用 retryEmptyResponseRequest 没有传
buildPoolCtx,导致换号后的 createResponse 绕过 WS pool,prompt
cache 抖动问题对流式客户端依然存在。non-streaming 路径已正确
传递。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant