Skip to content

fix(codex): avoid custom keepalive transport on chatgpt backend#12953

Open
liuhao1024 wants to merge 5 commits into
NousResearch:mainfrom
liuhao1024:fix/codex-skip-keepalive-transport
Open

fix(codex): avoid custom keepalive transport on chatgpt backend#12953
liuhao1024 wants to merge 5 commits into
NousResearch:mainfrom
liuhao1024:fix/codex-skip-keepalive-transport

Conversation

@liuhao1024

@liuhao1024 liuhao1024 commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This draft fixes a Codex-specific transport regression in _create_openai_client().

Hermes currently injects a custom httpx.HTTPTransport(socket_options=...) to enable TCP keepalives. That path is desirable for regular OpenAI-compatible endpoints, but it is not compatible with the ChatGPT Codex backend at https://chatgpt.com/backend-api/codex: the TLS handshake gets reset before the request reaches normal API validation.

This PR keeps the existing keepalive behavior for normal OpenAI-compatible endpoints, but skips custom transport injection for the Codex backend.

Closes #12952

Related Issue

N/A

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • See commit messages for detailed changes

How to Test

  1. Run pytest tests/ -q — all tests should pass
  2. Verify the specific scenario described above is resolved

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 26.4.1

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture and workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A

@liuhao1024

Copy link
Copy Markdown
Contributor Author

Maintainer note:

This is intentionally not a duplicate of the earlier transport regressions.

So the fix here is narrower: keep the current keepalive path for normal OpenAI-compatible endpoints, but bypass it for the Codex backend only.

@liuhao1024 liuhao1024 marked this pull request as ready for review April 20, 2026 10:03
@liuhao1024 liuhao1024 changed the title fix(codex): skip keepalive transport for chatgpt backend fix(codex): avoid custom keepalive transport on chatgpt backend Apr 20, 2026
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder provider/openai OpenAI / Codex Responses API labels Apr 22, 2026
@ratacat

ratacat commented Apr 25, 2026

Copy link
Copy Markdown

Supportive +1 from a real cron incident on macOS/launchd using openai-codex.

This PR looks like the right shape to me: skip the custom keepalive transport entirely for the ChatGPT Codex backend rather than trying to make the socket-options path work for an endpoint that appears sensitive to it. In our incident, a Codex cron run appeared to stream final output successfully but then the cron runner never returned/saved/released scheduler state. Avoiding the custom transport for Codex is a cleaner mitigation than piling more special-case timeout handling onto that transport path.

Small elegance suggestions, non-blocking:

  1. Consider extracting the Codex-backend test into a tiny helper, e.g. _is_chatgpt_codex_backend(provider, base_url), so the provider/base_url rule is named and reusable if other client construction paths need the same bypass later.

  2. It may be worth adding one assertion that regular OpenAI-compatible endpoints still pass configured timeout/proxy behavior through when the custom client is injected. This PR correctly focuses on Codex, but the broader class of bugs here is “custom http_client subtly changes OpenAI SDK semantics,” so pinning timeout preservation alongside proxy preservation would make the transport boundary more robust.

Overall: I think this is the right fix for #12952 and likely addresses the provider-side cause of the cron wedge we saw.

Address review suggestion to centralize the Codex backend detection logic
into _is_chatgpt_codex_backend() for reuse across client construction
paths. Improves maintainability and readability.

The helper checks both provider ('openai-codex') and base_url
(https://chatgpt.com/backend-api/codex) to identify the ChatGPT
Codex backend, which is incompatible with custom socket_options
in HTTPTransport.

Fixes: NousResearch#12952
Add regression test to ensure timeout is forwarded correctly when
the custom keepalive http_client is injected for regular OpenAI
endpoints. This addresses reviewer feedback that the broader
class of bugs is "custom http_client subtly changes OpenAI SDK
semantics" and pinning timeout preservation makes the transport
boundary more robust.

The test verifies that a configured timeout (300.0s) is passed
through to the OpenAI client constructor even when http_client
is injected, confirming that custom transport injection does not
silently drop timeout configuration.
Problem: PR NousResearch#12953 conflicted with upstream main in the OpenAI client keepalive/proxy construction path and its regression tests.

Resolution: keep main's base_url-aware NO_PROXY handling while skipping the custom keepalive http_client for the ChatGPT Codex backend; merge the regression coverage for regular OpenAI, Codex, NO_PROXY, and timeout behavior.
@liuhao1024 liuhao1024 force-pushed the fix/codex-skip-keepalive-transport branch from 2f3bcae to c7c0cf8 Compare April 28, 2026 03:21
@liuhao1024

Copy link
Copy Markdown
Contributor Author

@ratacat thanks for the suggestions. I updated the PR to address both points:

  • Extracted _is_chatgpt_codex_backend(provider, base_url) and use it to gate the ChatGPT Codex keepalive bypass.
  • Added regression coverage to ensure regular OpenAI-compatible endpoints still preserve proxy behavior and configured timeout when the custom keepalive client is injected.

I also synced with upstream main and resolved the merge conflicts.

Local targeted verification:
scripts/run_tests.sh tests/run_agent/test_create_openai_client_proxy_env.py tests/run_agent/test_create_openai_client_kwargs_isolation.py tests/run_agent/test_create_openai_client_reuse.py

Result: 15 passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists provider/openai OpenAI / Codex Responses API type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: custom keepalive transport breaks chatgpt codex backend

3 participants