Skip to content

Regression: gateway reuses mutated OpenAI http_client kwargs and accumulates stale connections #11070

@swnb

Description

@swnb

Summary

A regression in the TCP keepalive change causes long-running gateway sessions to reuse a mutated http_client stored inside self._client_kwargs, so later request clients are not actually fresh. In practice this leaves stale sockets in the shared pool and leads to repeated APIConnectionError / APITimeoutError on openai-codex, even for tiny prompts.

Related

Symptoms

  • hermes chat -q ... can succeed in a fresh shell
  • long-running hermes gateway run --replace sessions start failing with Connection error.
  • failures happen even on tiny contexts (for example 2-7 messages / ~5k-10k tokens)
  • gateway processes show lingering CLOSE_WAIT sockets until restart

Root Cause

run_agent.py::_create_openai_client() mutates the passed client_kwargs in-place by inserting a concrete http_client. In gateway mode, self._client_kwargs is retained across turns and reused by _create_request_openai_client(), so future "fresh" request clients silently reuse the same underlying httpx.Client / transport pool.

That defeats the intended per-request isolation and lets stale sockets survive across retries/turns.

Reproduction

  1. Run Hermes via gateway with openai-codex
  2. Send a few messages over time
  3. Observe repeated retries ending in APIConnectionError / APITimeoutError
  4. Inspect the process with lsof and note lingering CLOSE_WAIT sockets
  5. In a fresh shell, replay the same payload or run hermes chat -q ... and observe success

Expected

Each request client should get a fresh httpx.Client / transport pool, while the shared client keeps its own lifecycle.

Proposed Fix

  • copy client_kwargs at the start of _create_openai_client() instead of mutating the caller-owned dict
  • keep the keepalive http_client, but preserve OpenAI SDK default timeout / limits
  • add regression tests proving:
    • _client_kwargs is not mutated
    • request clients do not reuse the shared client's http_client
    • keepalive client keeps OpenAI default timeout instead of httpx's 5s default

Validation

Targeted tests:

  • pytest tests/run_agent/test_openai_client_lifecycle.py -q

Manual checks performed on macOS:

  • direct hermes chat -q works for both default and named profile
  • threaded / gateway-like probes work
  • long-running gateways recover after restart and stop showing stale CLOSE_WAIT buildup

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/agentCore agent loop, run_agent.py, prompt buildercomp/gatewayGateway runner, session dispatch, deliverysweeper:implemented-on-mainSweeper: behavior already present on current maintype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions