Skip to content

fix(proxy): close implicit-resume reverse mismatch + surface 4xx errors#450

Merged
icebear0828 merged 5 commits intodevfrom
fix/unanswered-tool-call-recovery
May 7, 2026
Merged

fix(proxy): close implicit-resume reverse mismatch + surface 4xx errors#450
icebear0828 merged 5 commits intodevfrom
fix/unanswered-tool-call-recovery

Conversation

@icebear0828
Copy link
Copy Markdown
Owner

Summary

evaluateImplicitResume 此前只做 forward 校验(新输入里每个 function_call_output.call_id 必须命中上一轮 stored function_call),漏了反向方向。当上一轮模型并发吐 N 个 tool_use 而客户端只回 N-1 个 tool_result 时,proxy 仍然带 previous_response_id 走 resume,上游用存的 context 触发 invalid_request_error: No tool output found for function call call_X。这条 400 在流式 catch 里只写一条 SSE error 事件、零日志,在非流式 collectErr 里又因为 translator 抛的是 plain Error、status 信息丢失,最终一律降级成 502。Claude Code 用户因此持续看到 4xx 报错而 proxy 日志里查不到任何 call_id 现场。

Changes

  • src/routes/shared/proxy-handler.tsevaluateImplicitResume 增加反向检查(reason: "unanswered_tool_calls"),并在请求路径上 warn 出缺失的 call_id;catch 块新增 isUnansweredFunctionCallError 分支,复用 prevRespNotFoundRetried 单次保护,strip previous_response_id + 完整重放
  • src/proxy/error-classification.ts — 新增 isUnansweredFunctionCallError(仅匹配 400 + "no tool output found for function call")
  • src/translation/codex-{anthropic,openai,gemini}.ts + 新文件 codex-api-error-from-event.ts — 上游 SSE error 事件改抛 CodexApiError(status, body),按 error code 映射 400/401/402/403/429(默认 502)
  • src/routes/shared/proxy-handler.ts (handleNonStreaming) — collectErr 加 instanceof CodexApiError 分支,透传上游真实 status 而不是走正则降 502
  • src/routes/shared/response-processor.ts — 流式 catch 新增 console.warn,把 status / msg / body 写进 dev-YYYY-MM-DD.log
  • tests/real/tools.test.ts/v1/chat/completions 两个 case 把 model:"codex" 改成 gpt-5.4(codex alias 在 2026-04-23 已被 config/models.yaml 注释里说明的那次提交移除,chat.ts 路径严格走 isRecognizedModelName 会 404)

Test Plan

  • npx vitest run tests/unit tests/integration — 1562/1562
  • npm run test:real -- unanswered-tool-call ×3 连跑全绿(每次 2/2,含 3 轮并行 tool_use 回归保护 + 缺 tool_result 走 4xx 透传分支)
  • npm run test:real -- tools — 8/8(修完后 /v1/chat/completions 两个 case 不再 404)
  • npx tsc --noEmit
  • 真实日志确认看到 resume=off:unanswered_tool_calls + upstream 400 during collect: ... call_X 双重证据

Notes

reverse check 只是 best-effort 防御——客户端真的少送一个 tool_result 时 proxy 没法替它补,最终还是会拿到上游 400,但现在客户端拿到的是带原文 message 的 400 而不是 502,proxy 日志里也有完整 call_id 现场可追。

…am errors

evaluateImplicitResume only checked that every function_call_output in the new
input mapped to a stored function_call (forward direction). When the previous
turn emitted N parallel tool_use blocks and the client only returned N-1
tool_results, the proxy still resumed with previous_response_id, and upstream
rejected with "No tool output found for function call call_X". The 400 was
silently dropped: the streaming path's catch only wrote an SSE error event
with no log, and the non-streaming collectErr path lost CodexApiError status
because the translator threw plain Error, so the status-regex fell back to
opaque 502.

- Add reverse check in evaluateImplicitResume → reason: unanswered_tool_calls,
  with a diagnostic warn naming the missing call_id.
- Add isUnansweredFunctionCallError classifier; proxy-handler catch strips
  previous_response_id + retries on the same account, mirroring the existing
  previous_response_not_found recovery.
- Translators now throw CodexApiError(status, body) via codexApiErrorFromEvent
  so handleNonStreaming surfaces the real status (400/429/etc.) instead of 502.
- streamResponse catch logs status + body excerpt so future occurrences leave
  evidence in dev-YYYY-MM-DD.log.
- Update tests/real/tools.test.ts /v1/chat/completions cases to use gpt-5.4
  after the codex alias was retired in 2026-04-23.
Review followups for #450:

- M1: handleNonStreaming.collectErr's instanceof CodexApiError branch now
  rethrows instead of returning c.json directly, so the outer
  handleProxyRequest catch runs unified strip+retry recovery for
  mid-SSE 400s (previous_response_not_found, unanswered_function_call).
  Releases the account before the throw; outer catch deduplicates via
  the released Set.
- M2: rename prevRespNotFoundRetried → stripAndRetryDone to reflect
  that the same flag now gates two recovery paths (prev_resp_not_found
  + unanswered_function_call), single-shot to avoid retry loops.
- L2: response-processor uses instanceof CodexApiError instead of
  duck-typed body/status access — strict typing, no foreign Error
  subclasses with same field names.
- L4: drop unused TIMEOUT import from real e2e.
- L6: tighten partial-tool_result assertion — must be 4xx with the
  upstream's verbatim "No tool output found for function call"
  message; 5xx now fails the test (catches future regressions where
  the SSE/collect path loses the real upstream status).
Discovered while re-reviewing #450: handleNonStreaming was releasing the
account at L833 before falling into the new CodexApiError rethrow branch.
After rethrow, outer catch's strip+retry path continues on the same entryId
without re-acquiring — but the slot was already returned to the pool, so
another concurrent request could grab it and race the retry.

Move the unconditional release below the rethrow branch. EmptyResponseError
exhausted + generic-fallback paths still release before returning their
respective Response. The released Set continues to guard double-release on
terminal paths owned by the outer catch.
…or log

Carryover from #450 review:

- L1: evaluateImplicitResume now returns missingCallIds / unansweredCallIds
  on rejection; handleProxyRequest reads those off the result instead of
  recomputing the same set difference inline.
- L5: introduce stripCodexErrorPrefix helper to remove the "Codex API error
  (NNN): " prefix when the warn already prints status= separately, so the
  log no longer reads "upstream 400 during collect: Codex API error (400):
  No tool output ..." (duplicated status).

L3 (status-code map shared with ws-pool.ts) deferred — larger refactor,
follow-up PR.
@icebear0828
Copy link
Copy Markdown
Owner Author

Review notes:

非流式路径这次修得比较完整,相关单测我本地跑过是通过的:

npx vitest run tests/unit/routes/shared/proxy-handler-implicit-resume.test.ts tests/unit/proxy/error-classification.test.ts

结果:33 tests passed。

还有一个边界建议补齐或在 PR 描述里明确:streaming translator 遇到 upstream evt.error 时,目前仍然是把错误转成普通 SSE 内容并 return,不会抛 CodexApiError。因此 streaming 请求遇到 No tool output found... 时不会进入新增的 strip previous_response_id + retry recovery,也不会触发 streamResponse catch 里的新日志。

涉及位置:

  • src/translation/codex-to-openai.ts
  • src/translation/codex-to-anthropic.ts
  • src/translation/codex-to-gemini.ts

如果这个 PR 的目标是完整 surface/recover 4xx,建议把 streaming path 也纳入测试和实现;如果只覆盖 non-streaming collect path,也建议在 PR 说明里写清楚边界。

@icebear0828 icebear0828 merged commit 38620ca into dev May 7, 2026
1 check passed
@icebear0828 icebear0828 deleted the fix/unanswered-tool-call-recovery branch May 7, 2026 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant