Skip to content

Bug: custom keepalive transport breaks chatgpt codex backend #12952

@liuhao1024

Description

@liuhao1024

Summary

openai-codex requests to https://chatgpt.com/backend-api/codex can fail with:

APIConnectionError: Connection error.

The current keepalive transport injection inside run_agent.py::_create_openai_client() is safe for regular OpenAI-compatible endpoints, but it is not compatible with the ChatGPT Codex backend. When Hermes injects a custom httpx.HTTPTransport(socket_options=...), the TLS handshake to chatgpt.com is reset by peer before the request reaches application-level validation.

This is distinct from the earlier stale-client regression fixed in #11033 / #11070, and also distinct from the proxy-env regression addressed in #12008. The remaining bug is endpoint-specific: the Codex backend fails only on the custom keepalive transport path.

Affected versions / baseline

  • Local runtime version: Hermes Agent v0.10.0 (2026.4.16)
  • Comparison baseline: official tag v2026.4.16
  • Keepalive was introduced after that tag in:
    • 12b109b6640a573abf685d3c881cab2a9fc5c3aafix: enable TCP keepalives to detect dead provider connections (#10324) (#10933)
    • reverted in e07dbde582e6c80f80eb0d3040add8331832a87b
    • re-landed in 8c478983ed0ec5609212950d5044398dd4d27a5afix: enable TCP keepalives to detect dead provider connections (#10324) (#11277)
    • later refactored into _build_keepalive_http_client() in d393104bad62700d8e33de003502b3f74854151a

Symptoms

  • provider=openai-codex
  • model=gpt-5.4
  • base_url=https://chatgpt.com/backend-api/codex
  • Hermes retries with APIConnectionError: Connection error.
  • Switching away from the injected keepalive transport restores reachability immediately

Reproduction

A minimal httpx comparison shows the issue without needing the full gateway stack:

import json
import httpx
import socket
from pathlib import Path

obj = json.loads((Path.home() / ".hermes" / "auth.json").read_text())
tok = obj["providers"]["openai-codex"]["tokens"]["access_token"]
headers = {
    "Authorization": f"Bearer {tok}",
    "User-Agent": "codex_cli_rs/0.0.0 (Hermes Agent)",
    "originator": "codex_cli_rs",
    "Content-Type": "application/json",
}
url = "https://chatgpt.com/backend-api/codex/responses"
payload = {
    "model": "gpt-5.4",
    "instructions": "Say hello",
    "input": "hello",
    "stream": True,
    "store": False,
}

with httpx.Client(timeout=20.0, http2=False) as c:
    r = c.post(url, headers=headers, json=payload)
    print("default", r.status_code, r.text[:120])

transport = httpx.HTTPTransport(socket_options=[
    (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
])
with httpx.Client(timeout=20.0, transport=transport, http2=False) as c:
    r = c.post(url, headers=headers, json=payload)
    print("custom", r.status_code, r.text[:120])

Observed result locally:

default 400 {"detail":"Input must be a list"}
custom ConnectError('[Errno 54] Connection reset by peer')

The key point is that the default client reaches the backend and gets a normal protocol-validation response, while the keepalive transport fails during connection setup.

Root cause

_create_openai_client() currently injects a custom http_client whenever one is not already present:

if "http_client" not in client_kwargs:
    keepalive_http = self._build_keepalive_http_client()
    if keepalive_http is not None:
        client_kwargs["http_client"] = keepalive_http

That path works for most OpenAI-compatible endpoints, but chatgpt.com/backend-api/codex is more strict. With the keepalive socket options enabled, the TLS handshake is reset ([Errno 54] Connection reset by peer).

Proposed fix

Skip keepalive http_client injection for the Codex backend and let the OpenAI SDK construct its default transport:

  • match provider == "openai-codex"
  • or match base_url.startswith("https://chatgpt.com/backend-api/codex")

This keeps the keepalive optimization for normal OpenAI-compatible endpoints while restoring compatibility for Codex.

Why this is safe

  • The change is narrowly scoped to the Codex backend only.
  • Existing keepalive behavior remains unchanged for standard OpenAI-compatible APIs.
  • Proxy-related behavior for non-Codex endpoints remains covered by tests from fix(codex): preserve proxy env when creating OpenAI client #12008.
  • New regression tests can pin both behaviors:
    • non-Codex endpoints still inject a keepalive/proxy-aware client
    • Codex endpoints do not inject a custom client

Related

I have a local patch + regression tests ready and can open a draft PR that references this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt builderprovider/openaiOpenAI / Codex Responses APItype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions