Skip to content

fix(agent): avoid reusing mutated http client kwargs#11072

Closed
swnb wants to merge 1 commit into
NousResearch:mainfrom
swnb:fix/gateway-openai-client-kwargs
Closed

fix(agent): avoid reusing mutated http client kwargs#11072
swnb wants to merge 1 commit into
NousResearch:mainfrom
swnb:fix/gateway-openai-client-kwargs

Conversation

@swnb

@swnb swnb commented Apr 16, 2026

Copy link
Copy Markdown

Summary

  • avoid mutating caller-owned client_kwargs in _create_openai_client()
  • preserve OpenAI SDK timeout/connection limits when injecting keepalive http_client
  • add regression coverage for timeout preservation and per-request client isolation

Root Cause

The keepalive change introduced a subtle gateway regression:

  • _create_openai_client() inserted a concrete http_client into the passed client_kwargs
  • gateway sessions retain self._client_kwargs across turns
  • later request clients silently reused the same underlying httpx.Client / transport pool instead of getting a fresh one
  • stale sockets then survived across retries/turns and produced repeated APIConnectionError / APITimeoutError

This was especially visible in long-running gateway processes, while fresh-shell hermes chat -q ... calls could still succeed.

Fix

  1. Copy client_kwargs on entry so the caller-owned dict is never mutated
  2. Keep the TCP keepalive transport, but explicitly preserve OpenAI SDK default timeout / connection limits
  3. Add regression tests to ensure:
    • _client_kwargs is not mutated
    • request clients do not reuse the shared client's http_client
    • keepalive clients do not fall back to httpx's 5s default timeout

How To Reproduce

Before this fix:

  1. run Hermes through gateway with openai-codex
  2. keep the gateway alive across multiple turns / retries
  3. observe repeated Connection error. / Request timed out. even for tiny prompts
  4. inspect the daemon with lsof and note lingering stale sockets / pool reuse behavior
  5. compare with a fresh-shell hermes chat -q ..., which can still succeed

Validation

Tests:

  • pytest tests/run_agent/test_openai_client_lifecycle.py -q

Manual checks on macOS:

  • hermes chat -q succeeds for both default and named profile
  • threaded / gateway-like probes succeed
  • default and ferriscluster gateway logs resumed normal responses after restart

Platforms Tested

  • macOS

Closes #11070

@swnb

swnb commented Apr 16, 2026

Copy link
Copy Markdown
Author

Closing this as a duplicate / superseded by #11056. I hit the same root cause locally while debugging gateway regressions on macOS, and the detailed reproduction / validation notes remain in #11070.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression: gateway reuses mutated OpenAI http_client kwargs and accumulates stale connections

1 participant