feat(proxy): WebSocket connection pool fixes prompt-cache hit-rate jitter#440
Merged
icebear0828 merged 4 commits intodevfrom May 4, 2026
Merged
feat(proxy): WebSocket connection pool fixes prompt-cache hit-rate jitter#440icebear0828 merged 4 commits intodevfrom
icebear0828 merged 4 commits intodevfrom
Conversation
icebear0828
added a commit
that referenced
this pull request
May 4, 2026
PR #440 加了 config schema 字段 ws_pool 但忘了真正读取它 —— `getWsPool()` 永远用 DEFAULT_WS_POOL_CONFIG,用户改 `ws_pool.enabled: false` 完全无效, 回滚策略破坏。 修复:startServer() 在加载 cfg 后立即调 setWsPoolConfig() 用用户配置替换 单例。同时把 shutdown 钩子的 dynamic import 一起改成 static(同模块复用)。 新增 3 个测试覆盖单例 wiring 链路: - enabled:false → acquire 永远 bypass(disabled),factory 不调 - 默认 getWsPool() 工作正常 - setWsPoolConfig 后调覆盖前调 测试:1705 passed (+1 skipped)
新增 PersistentWs + WsConnectionPool + WsReusedConnectionError,准备解决
上游 WS gateway 按连接 ID 哈希路由导致的 prompt cache 命中率抖动
(5%~99% bimodal)。本 commit 纯模块新增 + 26 个单测,未接线,零回归风险。
设计要点:
- Pool key = `${entryId}:${conversationId}`,覆盖显式 + 隐式续链
- 单 WS strict 串行(codex 协议要求),busy 时 acquire 返回 bypass 让 caller
走旧路径(不排队,避免死锁)
- 无 idle TTL;max_age=55min 留 5min 缓冲(server 60min 硬限)
- 死连/abort/账号状态变化级联清理
- 复用失败(pre-response)抛 WsReusedConnectionError,caller 可单次 retry;
流中段失败走 controller.error 不 retry(client 已收到部分数据)
config schema 新增 ws_pool: { enabled: true, max_age_ms, max_per_account }
…dler
接线 ws-pool 到主请求链路。
ws-transport.ts:
- 抽出 openOneShotWs 保持原有 one-shot 语义(向后兼容,旧调用方零变更)
- createWebSocketResponse 增加 poolCtx? 参数;带 ctx 时先尝试 pool.acquire
→ 命中复用 PersistentWs.send;遇 WsReusedConnectionError 单次回退 one-shot
- pool 自身故障(factory 抛错)也会 fallback 到 one-shot,不污染调用方
codex-api.ts:
- createResponse / createResponseViaWebSocket 透传 poolCtx 到 createWebSocketResponse
- HTTP 路径完全不受影响
proxy-handler.ts:
- buildPoolCtx() 根据 useWebSocket + chainConversationId 生成 poolKey
= `${entryId}:${chainConversationId}`,仅 WS 路径生效
- 主流程 + handleNonStreaming 的 empty-response retry 都用同一个 builder
account-pool.ts:
- markStatus(non-active) / markRateLimited / removeAccount / updateToken(refresh
完成)级联调 evictByEntryId 关闭该账号所有池中 WS。理由:refresh 后老 WS
携带的 access_token 已失效;其他状态变化下账号本身不可用,留池只浪费
- 用 dynamic import 隔离 ws-pool,避免 account-pool 单测必须拉 proxy 层
src/index.ts: SIGINT/SIGTERM shutdown 钩子追加 wsPool.shutdown() 优雅关闭
集成测:tests/integration/ws-pool-reuse.test.ts 起本地 ws.Server 验证
- 5 turn 同 conv → server 仅 1 次 connection
- 不同 conv → 各自 1 个连接
- server 主动 close → 池立即驱逐,下次 acquire 新建
- enabled:false → 退化为 one-shot(每 turn 新连接)
- evictByEntryId → 池清空 + 后续重建
- 不传 poolCtx → 行为完全等价于今天
测试:1702 passed (+1 skipped),零回归
ws-transport: 新增 WsDispatchDecision 类型 + WsPoolContext.onDecision 回调, 四种决定 (reuse / new / bypass:<reason> / retry-after-stale-reuse) 在 dispatch 时刻一次性 emit 给 caller。 proxy-handler: buildPoolCtx 装上 onDecision listener,对每个 WS 请求 emit 一行 `[fmt] Account E | rid=R | ws=reuse:abc` 之类的日志,便于直接观察 池命中率 + 抖动归因。 CHANGELOG: Unreleased → Fixed 加完整条目,写明问题、原因、配置、回滚。
PR #440 加了 config schema 字段 ws_pool 但忘了真正读取它 —— `getWsPool()` 永远用 DEFAULT_WS_POOL_CONFIG,用户改 `ws_pool.enabled: false` 完全无效, 回滚策略破坏。 修复:startServer() 在加载 cfg 后立即调 setWsPoolConfig() 用用户配置替换 单例。同时把 shutdown 钩子的 dynamic import 一起改成 static(同模块复用)。 新增 3 个测试覆盖单例 wiring 链路: - enabled:false → acquire 永远 bypass(disabled),factory 不调 - 默认 getWsPool() 工作正常 - setWsPoolConfig 后调覆盖前调 测试:1705 passed (+1 skipped)
bdfd173 to
881926a
Compare
icebear0828
added a commit
that referenced
this pull request
May 5, 2026
The soak check measures `now - dev_HEAD_timestamp >= 24h`, which means every new merge into dev resets the clock. Under any non-trivial merge cadence, dev never satisfies the soak gate and master stagnates: PRs #437/#438/#439/#440/#442 all stacked on dev for a week with no promotion. Add a `force_skip_soak` boolean input to workflow_dispatch (default false). Schedule cron remains untouched and continues to enforce the 24h rule. Only manual triggers can bypass, and only when the operator explicitly sets the input to true — intended for sync-back / merge commits whose content has actually been on dev long enough but whose HEAD timestamp is misleadingly fresh. Test plan: yaml syntax verified via js-yaml. Functional verification will be the next manual workflow_dispatch run with the input set. Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>
icebear0828
added a commit
that referenced
this pull request
May 5, 2026
…tter (#440) * feat(proxy): add WS connection pool module (no wiring yet) 新增 PersistentWs + WsConnectionPool + WsReusedConnectionError,准备解决 上游 WS gateway 按连接 ID 哈希路由导致的 prompt cache 命中率抖动 (5%~99% bimodal)。本 commit 纯模块新增 + 26 个单测,未接线,零回归风险。 设计要点: - Pool key = `${entryId}:${conversationId}`,覆盖显式 + 隐式续链 - 单 WS strict 串行(codex 协议要求),busy 时 acquire 返回 bypass 让 caller 走旧路径(不排队,避免死锁) - 无 idle TTL;max_age=55min 留 5min 缓冲(server 60min 硬限) - 死连/abort/账号状态变化级联清理 - 复用失败(pre-response)抛 WsReusedConnectionError,caller 可单次 retry; 流中段失败走 controller.error 不 retry(client 已收到部分数据) config schema 新增 ws_pool: { enabled: true, max_age_ms, max_per_account } * feat(proxy): wire WS connection pool through ws-transport + proxy-handler 接线 ws-pool 到主请求链路。 ws-transport.ts: - 抽出 openOneShotWs 保持原有 one-shot 语义(向后兼容,旧调用方零变更) - createWebSocketResponse 增加 poolCtx? 参数;带 ctx 时先尝试 pool.acquire → 命中复用 PersistentWs.send;遇 WsReusedConnectionError 单次回退 one-shot - pool 自身故障(factory 抛错)也会 fallback 到 one-shot,不污染调用方 codex-api.ts: - createResponse / createResponseViaWebSocket 透传 poolCtx 到 createWebSocketResponse - HTTP 路径完全不受影响 proxy-handler.ts: - buildPoolCtx() 根据 useWebSocket + chainConversationId 生成 poolKey = `${entryId}:${chainConversationId}`,仅 WS 路径生效 - 主流程 + handleNonStreaming 的 empty-response retry 都用同一个 builder account-pool.ts: - markStatus(non-active) / markRateLimited / removeAccount / updateToken(refresh 完成)级联调 evictByEntryId 关闭该账号所有池中 WS。理由:refresh 后老 WS 携带的 access_token 已失效;其他状态变化下账号本身不可用,留池只浪费 - 用 dynamic import 隔离 ws-pool,避免 account-pool 单测必须拉 proxy 层 src/index.ts: SIGINT/SIGTERM shutdown 钩子追加 wsPool.shutdown() 优雅关闭 集成测:tests/integration/ws-pool-reuse.test.ts 起本地 ws.Server 验证 - 5 turn 同 conv → server 仅 1 次 connection - 不同 conv → 各自 1 个连接 - server 主动 close → 池立即驱逐,下次 acquire 新建 - enabled:false → 退化为 one-shot(每 turn 新连接) - evictByEntryId → 池清空 + 后续重建 - 不传 poolCtx → 行为完全等价于今天 测试:1702 passed (+1 skipped),零回归 * feat(proxy): observability for WS pool decisions + CHANGELOG ws-transport: 新增 WsDispatchDecision 类型 + WsPoolContext.onDecision 回调, 四种决定 (reuse / new / bypass:<reason> / retry-after-stale-reuse) 在 dispatch 时刻一次性 emit 给 caller。 proxy-handler: buildPoolCtx 装上 onDecision listener,对每个 WS 请求 emit 一行 `[fmt] Account E | rid=R | ws=reuse:abc` 之类的日志,便于直接观察 池命中率 + 抖动归因。 CHANGELOG: Unreleased → Fixed 加完整条目,写明问题、原因、配置、回滚。 * fix(proxy): wire ws_pool config into singleton (self-review of #440) PR #440 加了 config schema 字段 ws_pool 但忘了真正读取它 —— `getWsPool()` 永远用 DEFAULT_WS_POOL_CONFIG,用户改 `ws_pool.enabled: false` 完全无效, 回滚策略破坏。 修复:startServer() 在加载 cfg 后立即调 setWsPoolConfig() 用用户配置替换 单例。同时把 shutdown 钩子的 dynamic import 一起改成 static(同模块复用)。 新增 3 个测试覆盖单例 wiring 链路: - enabled:false → acquire 永远 bypass(disabled),factory 不调 - 默认 getWsPool() 工作正常 - setWsPoolConfig 后调覆盖前调 测试:1705 passed (+1 skipped) --------- Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>
icebear0828
added a commit
that referenced
this pull request
May 5, 2026
The soak check measures `now - dev_HEAD_timestamp >= 24h`, which means every new merge into dev resets the clock. Under any non-trivial merge cadence, dev never satisfies the soak gate and master stagnates: PRs #437/#438/#439/#440/#442 all stacked on dev for a week with no promotion. Add a `force_skip_soak` boolean input to workflow_dispatch (default false). Schedule cron remains untouched and continues to enforce the 24h rule. Only manual triggers can bypass, and only when the operator explicitly sets the input to true — intended for sync-back / merge commits whose content has actually been on dev long enough but whose HEAD timestamp is misleadingly fresh. Test plan: yaml syntax verified via js-yaml. Functional verification will be the next manual workflow_dispatch run with the input set. Co-authored-by: icebear0828 <icebear0828@users.noreply.github.com>
hangox
added a commit
to hangox/codex-proxy
that referenced
this pull request
May 7, 2026
合并上游 icebear0828#440 (WebSocket connection pool) 时遗漏:streaming 路径的 empty-response 换号重试调用 retryEmptyResponseRequest 没有传 buildPoolCtx,导致换号后的 createResponse 绕过 WS pool,prompt cache 抖动问题对流式客户端依然存在。non-streaming 路径已正确 传递。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
修上游 prompt cache 命中率在 5%~99% 之间剧烈抖动的问题。根因:codex-proxy 对每个上游 WS 请求都
new WebSocket(url),OpenAI 的 WS gateway 按"连接 ID"做 LB hash → 同一逻辑会话被随机扔到不同 backend,每个 backend 各自缓存了不同长度的前缀,导致 cached_tokens 反复出现 1920 / 2432 / 24448 / 40320 / 47488 等离散"checkpoint"。参考真 codex CLI(
core/src/client.rs:802的WebsocketSession.connection)的策略,引入 per-(entryId, conversationId)持久 WS 连接池:同会话同账号的连续 turn 全部走同一条物理 WS,让上游 LB 钉到同一 backend,prompt cache 稳态命中。Architecture
src/proxy/ws-pool.ts(新):PersistentWs+WsConnectionPool+WsReusedConnectionError\${entryId}:\${conversationId},覆盖显式/隐式续链max_age_ms(55 min,比 server 60 min 硬限提前关)/ 账号状态变化(evictByEntryId)时清理WsReusedConnectionError→ 单次 fallback 一次性 WS;流中段 close → 抛给 client(不 retry,已收到部分数据)account-pool.ts在markRateLimited/markStatus(non-active)/removeAccount/updateToken(refresh)时级联evictByEntryId,避免老 WS 携带的 access_token 被复用SIGTERM/SIGINT钩子追加wsPool.shutdown()src/proxy/ws-transport.ts:抽出openOneShotWs(旧路径),createWebSocketResponse接 optionalpoolCtx?参数;带 ctx 时先尝试pool.acquire,否则回到 one-shot。pool/factory 任何故障都 fallback 到 one-shot,绝不污染 callersrc/proxy/codex-api.ts/src/routes/shared/proxy-handler.ts:透传poolCtx;buildPoolCtx()在 implicit resume / explicit prev_resp_id 等 WS 路径下生成 ctxws_pool: { enabled: true (默认), max_age_ms: 3300000, max_per_account: 8 }:默认开(bugfix 性质),enabled: false+ 重启即可回滚Observability
入口日志新增
ws=字段:ws=reuse:abc12345— 池命中复用ws=new:def67890— 池 miss 新建ws=bypass(busy|cap|dead|disabled|no_key|factory_error)— 旁路一次性ws=retry-after-stale-reuse:abc12345— 复用失败自动单次重试配合现有
rid/conv/key/prev/resume/hit字段可直接观察池行为 + 命中率归因。Test Plan
npm test— 1702 passed (+1 skipped),零回归npx tsc --noEmit— cleannpm run build— vite + tsc OKtests/unit/proxy/ws-pool.test.ts— 26 个单测覆盖 PersistentWs + WsConnectionPool 完整生命周期(acquire/release/idle/dead/cap/abort/server limit/upgrade headers/...)tests/integration/ws-pool-reuse.test.ts— 起本地ws.Server验证:enabled: false→ 退化为 one-shot ✓evictByEntryId→ 池清空 + 后续重建 ✓ws-transport*.test.ts不动且全过(向后兼容)Notes
96339f1纯新增 ws-pool.ts + 26 个单测,零接线零回归5023709接线 ws-transport / proxy-handler / account-pool / index.ts + 集成测06407e6可观测性 + CHANGELOG