Summary
A regression in the TCP keepalive change causes long-running gateway sessions to reuse a mutated http_client stored inside self._client_kwargs, so later request clients are not actually fresh. In practice this leaves stale sockets in the shared pool and leads to repeated APIConnectionError / APITimeoutError on openai-codex, even for tiny prompts.
Related
Symptoms
hermes chat -q ... can succeed in a fresh shell
- long-running
hermes gateway run --replace sessions start failing with Connection error.
- failures happen even on tiny contexts (for example 2-7 messages / ~5k-10k tokens)
- gateway processes show lingering
CLOSE_WAIT sockets until restart
Root Cause
run_agent.py::_create_openai_client() mutates the passed client_kwargs in-place by inserting a concrete http_client. In gateway mode, self._client_kwargs is retained across turns and reused by _create_request_openai_client(), so future "fresh" request clients silently reuse the same underlying httpx.Client / transport pool.
That defeats the intended per-request isolation and lets stale sockets survive across retries/turns.
Reproduction
- Run Hermes via gateway with
openai-codex
- Send a few messages over time
- Observe repeated retries ending in
APIConnectionError / APITimeoutError
- Inspect the process with
lsof and note lingering CLOSE_WAIT sockets
- In a fresh shell, replay the same payload or run
hermes chat -q ... and observe success
Expected
Each request client should get a fresh httpx.Client / transport pool, while the shared client keeps its own lifecycle.
Proposed Fix
- copy
client_kwargs at the start of _create_openai_client() instead of mutating the caller-owned dict
- keep the keepalive
http_client, but preserve OpenAI SDK default timeout / limits
- add regression tests proving:
_client_kwargs is not mutated
- request clients do not reuse the shared client's
http_client
- keepalive client keeps OpenAI default timeout instead of
httpx's 5s default
Validation
Targeted tests:
pytest tests/run_agent/test_openai_client_lifecycle.py -q
Manual checks performed on macOS:
- direct
hermes chat -q works for both default and named profile
- threaded / gateway-like probes work
- long-running gateways recover after restart and stop showing stale
CLOSE_WAIT buildup
Summary
A regression in the TCP keepalive change causes long-running gateway sessions to reuse a mutated
http_clientstored insideself._client_kwargs, so later request clients are not actually fresh. In practice this leaves stale sockets in the shared pool and leads to repeatedAPIConnectionError/APITimeoutErroronopenai-codex, even for tiny prompts.Related
fix: enable TCP keepalives to detect dead provider connections)Symptoms
hermes chat -q ...can succeed in a fresh shellhermes gateway run --replacesessions start failing withConnection error.CLOSE_WAITsockets until restartRoot Cause
run_agent.py::_create_openai_client()mutates the passedclient_kwargsin-place by inserting a concretehttp_client. In gateway mode,self._client_kwargsis retained across turns and reused by_create_request_openai_client(), so future "fresh" request clients silently reuse the same underlyinghttpx.Client/ transport pool.That defeats the intended per-request isolation and lets stale sockets survive across retries/turns.
Reproduction
openai-codexAPIConnectionError/APITimeoutErrorlsofand note lingeringCLOSE_WAITsocketshermes chat -q ...and observe successExpected
Each request client should get a fresh
httpx.Client/ transport pool, while the shared client keeps its own lifecycle.Proposed Fix
client_kwargsat the start of_create_openai_client()instead of mutating the caller-owned dicthttp_client, but preserve OpenAI SDK default timeout / limits_client_kwargsis not mutatedhttp_clienthttpx's 5s defaultValidation
Targeted tests:
pytest tests/run_agent/test_openai_client_lifecycle.py -qManual checks performed on macOS:
hermes chat -qworks for both default and named profileCLOSE_WAITbuildup