Skip to content

fix(run_agent): prevent _create_openai_client from mutating caller's kwargs#10978

Closed
taeuk178 wants to merge 1 commit into
NousResearch:mainfrom
taeuk178:fix/http-client-kwargs-mutation
Closed

fix(run_agent): prevent _create_openai_client from mutating caller's kwargs#10978
taeuk178 wants to merge 1 commit into
NousResearch:mainfrom
taeuk178:fix/http-client-kwargs-mutation

Conversation

@taeuk178

Copy link
Copy Markdown
Contributor

Summary

  • Copy client_kwargs at entry of _create_openai_client so the function can't leak injected transport state back to callers (notably self._client_kwargs).
  • Add a regression test that pins the contract.

Why

This is preventive hardening, not a fix for a live bug. The class of bug it prevents was demonstrated by #10933 (reverted in e07dbde5), and the underlying issue it tried to solve (#10324 — connections hanging in CLOSE-WAIT) is still open, so another transport tweak inside this function is likely.

The regression pattern:

  1. __init__ stores self._client_kwargs = client_kwargs (line 1036), then calls self._create_openai_client(client_kwargs, ...) at line 1059. Same dict reference.
  2. _create_openai_client mutated client_kwargs by writing an httpx.Client into it (the block that fix: enable TCP keepalives to detect dead provider connections #10933 added, now reverted).
  3. After that first mutation, self._client_kwargs["http_client"] pointed to the same httpx.Client wrapped by self.client.
  4. On each request, _create_request_openai_client did request_kwargs = dict(self._client_kwargs) — a shallow copy that carried the http_client reference forward. _create_openai_client then saw "http_client" in client_kwargs and skipped creating a fresh one, so the request OpenAI client wrapped the same transport.
  5. _close_request_openai_client(request_client, reason="request_complete") ran after the request and closed that transport.
  6. The next request repeated step 4 — but now the transport was already closed. The request OpenAI client wrapped a dead httpx.Client, and every attempt raised APIConnectionError('Connection error.') with cause RuntimeError: Cannot send a request, as the client has been closed, through all 3 retries.

I hit this on Hermes v0.9.0 running against openai-codex:

ERROR root: API call failed after 3 retries. Connection error. | provider=openai-codex model=gpt-5.4 msgs=5 tokens=~6,288
DIAG: cause=RuntimeError:Cannot send a request, as the client has been closed. repr=APIConnectionError('Connection error.')

Single-turn chats (no tool calls → no second API call) worked. Any tool use (terminal, cron jobs, etc. → second API call required) failed every time. curl with the same request body succeeded, and the OpenAI Python SDK with the same body succeeded, which pointed at hermes-specific client state rather than the payload or endpoint.

Fix

client_kwargs = dict(client_kwargs) at the top of _create_openai_client. Any mutation further down now stays on a local copy, so callers' dicts are untouched regardless of what transport customization lands inside the function next.

Test plan

  • New test tests/run_agent/test_create_openai_client_kwargs_isolation.py — snapshots the input dict, calls _create_openai_client, asserts it's unchanged. Passes with the fix; fails against a version that reintroduces the fix: enable TCP keepalives to detect dead provider connections #10933 mutation.
  • pytest tests/run_agent/ — 785 passed on my machine. The only two failures I saw are environment-related (OPENROUTER_API_KEY not set in test env) and reproduce against main without this patch, so they're unrelated.

…kwargs

Copy client_kwargs at entry so the function can't leak injected transport
state back to callers. This is preventive hardening after NousResearch#10933 (reverted
in e07dbde) demonstrated how easy it is to poison self._client_kwargs: the
PR added an httpx.Client to the dict, which was then re-read on the next
request after the transport was torn down, yielding APIConnectionError with
cause "Cannot send a request, as the client has been closed" on every
retry. The underlying dead-connection issue (NousResearch#10324) is still open, so the
next transport/keepalive attempt will sit in the same codepath — the copy
locks the contract and a regression test pins it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kshitijk4poor pushed a commit that referenced this pull request Apr 16, 2026
…args

Shallow-copy client_kwargs at the top of _create_openai_client() to
prevent in-place mutation from leaking back into self._client_kwargs.
Defensive fix that locks the contract for future httpx/transport work.

Cherry-picked from #10978 by @taeuk178.
@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Salvaged into #11056. Cherry-picked with authorship preserved onto current main. Test passes. Thanks @taeuk178!

kshitijk4poor pushed a commit that referenced this pull request Apr 16, 2026
…args

Shallow-copy client_kwargs at the top of _create_openai_client() to
prevent in-place mutation from leaking back into self._client_kwargs.
Defensive fix that locks the contract for future httpx/transport work.

Cherry-picked from #10978 by @taeuk178.
teknium1 added a commit that referenced this pull request Apr 16, 2026
Re-land of #10933, now guarded by the tests in #11266.

## The original problem (#10324)

When a provider drops a TCP connection mid-stream, the socket can enter
CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal
arrives, so the httpx read timeout never triggers and the agent hangs
indefinitely. The other defenses (''_force_close_tcp_sockets'', stale
stream detector) all ride on the socket layer reporting the dead
connection, which it never does without probes.

## The fix

Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT''
into the httpx transport. Kernel probes after 30s idle, retries every
10s, gives up after 3 → dead peer detected within ~60s instead of
hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux,
''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere
the socket options aren't available.

## Why this one doesn't regress like #10933 did

The original land (#10933) mutated ''client_kwargs'' in place when it
injected the ''httpx.Client''. Since callers pass ''self._client_kwargs''
by reference, the injected client leaked into the instance state. After
the first request, the OpenAI SDK closed its ''http_client'' — including
the injected one. The next ''_create_openai_client'' call re-read the
now-closed ''httpx.Client'' from ''self._client_kwargs'' and every
subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError:
Cannot send a request, as the client has been closed'' (AlexKucera's
Discord report, 2026-04-16).

The defensive ''client_kwargs = dict(client_kwargs)'' copy already on
main (taeuk178's #10978) means this injection only lands in the
per-call local copy. Each ''_create_openai_client'' invocation gets
its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired
''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild,
teardown, credential rotation), its ''httpx.Client'' closes with it
and the next call constructs a fresh one — no stale closed transport
can be reused.

## Validation

Full 4-test matrix all green (unit + live with real OpenRouter round
trips, HERMES_LIVE_TESTS=1):

    tests/run_agent/test_create_openai_client_kwargs_isolation.py      PASS
    tests/run_agent/test_create_openai_client_reuse.py                 PASS (2)
    tests/run_agent/test_sequential_chats_live.py                      PASS

Socket options verified on the live httpx transport:

    _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)]
    = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3)

Sequential-chat reproduction of the #10933 failure was explicitly
run against this patch — the defensive copy on main prevents the
closed transport from leaking back into ''self._client_kwargs'', so
every rebuild constructs a fresh transport.

Closes #10324
teknium1 added a commit that referenced this pull request Apr 17, 2026
Re-land of #10933, now guarded by the tests in #11266.

When a provider drops a TCP connection mid-stream, the socket can enter
CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal
arrives, so the httpx read timeout never triggers and the agent hangs
indefinitely. The other defenses (''_force_close_tcp_sockets'', stale
stream detector) all ride on the socket layer reporting the dead
connection, which it never does without probes.

Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT''
into the httpx transport. Kernel probes after 30s idle, retries every
10s, gives up after 3 → dead peer detected within ~60s instead of
hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux,
''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere
the socket options aren't available.

The original land (#10933) mutated ''client_kwargs'' in place when it
injected the ''httpx.Client''. Since callers pass ''self._client_kwargs''
by reference, the injected client leaked into the instance state. After
the first request, the OpenAI SDK closed its ''http_client'' — including
the injected one. The next ''_create_openai_client'' call re-read the
now-closed ''httpx.Client'' from ''self._client_kwargs'' and every
subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError:
Cannot send a request, as the client has been closed'' (AlexKucera's
Discord report, 2026-04-16).

The defensive ''client_kwargs = dict(client_kwargs)'' copy already on
main (taeuk178's #10978) means this injection only lands in the
per-call local copy. Each ''_create_openai_client'' invocation gets
its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired
''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild,
teardown, credential rotation), its ''httpx.Client'' closes with it
and the next call constructs a fresh one — no stale closed transport
can be reused.

Full 4-test matrix all green (unit + live with real OpenRouter round
trips, HERMES_LIVE_TESTS=1):

    tests/run_agent/test_create_openai_client_kwargs_isolation.py      PASS
    tests/run_agent/test_create_openai_client_reuse.py                 PASS (2)
    tests/run_agent/test_sequential_chats_live.py                      PASS

Socket options verified on the live httpx transport:

    _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)]
    = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3)

Sequential-chat reproduction of the #10933 failure was explicitly
run against this patch — the defensive copy on main prevents the
closed transport from leaking back into ''self._client_kwargs'', so
every rebuild constructs a fresh transport.

Closes #10324
teknium1 added a commit that referenced this pull request Apr 17, 2026
… (#11277)

Re-land of #10933, now guarded by the tests in #11266.

When a provider drops a TCP connection mid-stream, the socket can enter
CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal
arrives, so the httpx read timeout never triggers and the agent hangs
indefinitely. The other defenses (''_force_close_tcp_sockets'', stale
stream detector) all ride on the socket layer reporting the dead
connection, which it never does without probes.

Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT''
into the httpx transport. Kernel probes after 30s idle, retries every
10s, gives up after 3 → dead peer detected within ~60s instead of
hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux,
''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere
the socket options aren't available.

The original land (#10933) mutated ''client_kwargs'' in place when it
injected the ''httpx.Client''. Since callers pass ''self._client_kwargs''
by reference, the injected client leaked into the instance state. After
the first request, the OpenAI SDK closed its ''http_client'' — including
the injected one. The next ''_create_openai_client'' call re-read the
now-closed ''httpx.Client'' from ''self._client_kwargs'' and every
subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError:
Cannot send a request, as the client has been closed'' (AlexKucera's
Discord report, 2026-04-16).

The defensive ''client_kwargs = dict(client_kwargs)'' copy already on
main (taeuk178's #10978) means this injection only lands in the
per-call local copy. Each ''_create_openai_client'' invocation gets
its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired
''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild,
teardown, credential rotation), its ''httpx.Client'' closes with it
and the next call constructs a fresh one — no stale closed transport
can be reused.

Full 4-test matrix all green (unit + live with real OpenRouter round
trips, HERMES_LIVE_TESTS=1):

    tests/run_agent/test_create_openai_client_kwargs_isolation.py      PASS
    tests/run_agent/test_create_openai_client_reuse.py                 PASS (2)
    tests/run_agent/test_sequential_chats_live.py                      PASS

Socket options verified on the live httpx transport:

    _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)]
    = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3)

Sequential-chat reproduction of the #10933 failure was explicitly
run against this patch — the defensive copy on main prevents the
closed transport from leaking back into ''self._client_kwargs'', so
every rebuild constructs a fresh transport.

Closes #10324
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…args

Shallow-copy client_kwargs at the top of _create_openai_client() to
prevent in-place mutation from leaking back into self._client_kwargs.
Defensive fix that locks the contract for future httpx/transport work.

Cherry-picked from NousResearch#10978 by @taeuk178.
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…esearch#10324) (NousResearch#11277)

Re-land of NousResearch#10933, now guarded by the tests in NousResearch#11266.

When a provider drops a TCP connection mid-stream, the socket can enter
CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal
arrives, so the httpx read timeout never triggers and the agent hangs
indefinitely. The other defenses (''_force_close_tcp_sockets'', stale
stream detector) all ride on the socket layer reporting the dead
connection, which it never does without probes.

Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT''
into the httpx transport. Kernel probes after 30s idle, retries every
10s, gives up after 3 → dead peer detected within ~60s instead of
hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux,
''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere
the socket options aren't available.

The original land (NousResearch#10933) mutated ''client_kwargs'' in place when it
injected the ''httpx.Client''. Since callers pass ''self._client_kwargs''
by reference, the injected client leaked into the instance state. After
the first request, the OpenAI SDK closed its ''http_client'' — including
the injected one. The next ''_create_openai_client'' call re-read the
now-closed ''httpx.Client'' from ''self._client_kwargs'' and every
subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError:
Cannot send a request, as the client has been closed'' (AlexKucera's
Discord report, 2026-04-16).

The defensive ''client_kwargs = dict(client_kwargs)'' copy already on
main (taeuk178's NousResearch#10978) means this injection only lands in the
per-call local copy. Each ''_create_openai_client'' invocation gets
its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired
''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild,
teardown, credential rotation), its ''httpx.Client'' closes with it
and the next call constructs a fresh one — no stale closed transport
can be reused.

Full 4-test matrix all green (unit + live with real OpenRouter round
trips, HERMES_LIVE_TESTS=1):

    tests/run_agent/test_create_openai_client_kwargs_isolation.py      PASS
    tests/run_agent/test_create_openai_client_reuse.py                 PASS (2)
    tests/run_agent/test_sequential_chats_live.py                      PASS

Socket options verified on the live httpx transport:

    _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)]
    = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3)

Sequential-chat reproduction of the NousResearch#10933 failure was explicitly
run against this patch — the defensive copy on main prevents the
closed transport from leaking back into ''self._client_kwargs'', so
every rebuild constructs a fresh transport.

Closes NousResearch#10324
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
…args

Shallow-copy client_kwargs at the top of _create_openai_client() to
prevent in-place mutation from leaking back into self._client_kwargs.
Defensive fix that locks the contract for future httpx/transport work.

Cherry-picked from NousResearch#10978 by @taeuk178.
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
…esearch#10324) (NousResearch#11277)

Re-land of NousResearch#10933, now guarded by the tests in NousResearch#11266.

When a provider drops a TCP connection mid-stream, the socket can enter
CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal
arrives, so the httpx read timeout never triggers and the agent hangs
indefinitely. The other defenses (''_force_close_tcp_sockets'', stale
stream detector) all ride on the socket layer reporting the dead
connection, which it never does without probes.

Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT''
into the httpx transport. Kernel probes after 30s idle, retries every
10s, gives up after 3 → dead peer detected within ~60s instead of
hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux,
''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere
the socket options aren't available.

The original land (NousResearch#10933) mutated ''client_kwargs'' in place when it
injected the ''httpx.Client''. Since callers pass ''self._client_kwargs''
by reference, the injected client leaked into the instance state. After
the first request, the OpenAI SDK closed its ''http_client'' — including
the injected one. The next ''_create_openai_client'' call re-read the
now-closed ''httpx.Client'' from ''self._client_kwargs'' and every
subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError:
Cannot send a request, as the client has been closed'' (AlexKucera's
Discord report, 2026-04-16).

The defensive ''client_kwargs = dict(client_kwargs)'' copy already on
main (taeuk178's NousResearch#10978) means this injection only lands in the
per-call local copy. Each ''_create_openai_client'' invocation gets
its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired
''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild,
teardown, credential rotation), its ''httpx.Client'' closes with it
and the next call constructs a fresh one — no stale closed transport
can be reused.

Full 4-test matrix all green (unit + live with real OpenRouter round
trips, HERMES_LIVE_TESTS=1):

    tests/run_agent/test_create_openai_client_kwargs_isolation.py      PASS
    tests/run_agent/test_create_openai_client_reuse.py                 PASS (2)
    tests/run_agent/test_sequential_chats_live.py                      PASS

Socket options verified on the live httpx transport:

    _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)]
    = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3)

Sequential-chat reproduction of the NousResearch#10933 failure was explicitly
run against this patch — the defensive copy on main prevents the
closed transport from leaking back into ''self._client_kwargs'', so
every rebuild constructs a fresh transport.

Closes NousResearch#10324
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…args

Shallow-copy client_kwargs at the top of _create_openai_client() to
prevent in-place mutation from leaking back into self._client_kwargs.
Defensive fix that locks the contract for future httpx/transport work.

Cherry-picked from NousResearch#10978 by @taeuk178.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…esearch#10324) (NousResearch#11277)

Re-land of NousResearch#10933, now guarded by the tests in NousResearch#11266.

When a provider drops a TCP connection mid-stream, the socket can enter
CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal
arrives, so the httpx read timeout never triggers and the agent hangs
indefinitely. The other defenses (''_force_close_tcp_sockets'', stale
stream detector) all ride on the socket layer reporting the dead
connection, which it never does without probes.

Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT''
into the httpx transport. Kernel probes after 30s idle, retries every
10s, gives up after 3 → dead peer detected within ~60s instead of
hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux,
''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere
the socket options aren't available.

The original land (NousResearch#10933) mutated ''client_kwargs'' in place when it
injected the ''httpx.Client''. Since callers pass ''self._client_kwargs''
by reference, the injected client leaked into the instance state. After
the first request, the OpenAI SDK closed its ''http_client'' — including
the injected one. The next ''_create_openai_client'' call re-read the
now-closed ''httpx.Client'' from ''self._client_kwargs'' and every
subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError:
Cannot send a request, as the client has been closed'' (AlexKucera's
Discord report, 2026-04-16).

The defensive ''client_kwargs = dict(client_kwargs)'' copy already on
main (taeuk178's NousResearch#10978) means this injection only lands in the
per-call local copy. Each ''_create_openai_client'' invocation gets
its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired
''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild,
teardown, credential rotation), its ''httpx.Client'' closes with it
and the next call constructs a fresh one — no stale closed transport
can be reused.

Full 4-test matrix all green (unit + live with real OpenRouter round
trips, HERMES_LIVE_TESTS=1):

    tests/run_agent/test_create_openai_client_kwargs_isolation.py      PASS
    tests/run_agent/test_create_openai_client_reuse.py                 PASS (2)
    tests/run_agent/test_sequential_chats_live.py                      PASS

Socket options verified on the live httpx transport:

    _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)]
    = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3)

Sequential-chat reproduction of the NousResearch#10933 failure was explicitly
run against this patch — the defensive copy on main prevents the
closed transport from leaking back into ''self._client_kwargs'', so
every rebuild constructs a fresh transport.

Closes NousResearch#10324
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…args

Shallow-copy client_kwargs at the top of _create_openai_client() to
prevent in-place mutation from leaking back into self._client_kwargs.
Defensive fix that locks the contract for future httpx/transport work.

Cherry-picked from NousResearch#10978 by @taeuk178.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…esearch#10324) (NousResearch#11277)

Re-land of NousResearch#10933, now guarded by the tests in NousResearch#11266.

When a provider drops a TCP connection mid-stream, the socket can enter
CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal
arrives, so the httpx read timeout never triggers and the agent hangs
indefinitely. The other defenses (''_force_close_tcp_sockets'', stale
stream detector) all ride on the socket layer reporting the dead
connection, which it never does without probes.

Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT''
into the httpx transport. Kernel probes after 30s idle, retries every
10s, gives up after 3 → dead peer detected within ~60s instead of
hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux,
''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere
the socket options aren't available.

The original land (NousResearch#10933) mutated ''client_kwargs'' in place when it
injected the ''httpx.Client''. Since callers pass ''self._client_kwargs''
by reference, the injected client leaked into the instance state. After
the first request, the OpenAI SDK closed its ''http_client'' — including
the injected one. The next ''_create_openai_client'' call re-read the
now-closed ''httpx.Client'' from ''self._client_kwargs'' and every
subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError:
Cannot send a request, as the client has been closed'' (AlexKucera's
Discord report, 2026-04-16).

The defensive ''client_kwargs = dict(client_kwargs)'' copy already on
main (taeuk178's NousResearch#10978) means this injection only lands in the
per-call local copy. Each ''_create_openai_client'' invocation gets
its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired
''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild,
teardown, credential rotation), its ''httpx.Client'' closes with it
and the next call constructs a fresh one — no stale closed transport
can be reused.

Full 4-test matrix all green (unit + live with real OpenRouter round
trips, HERMES_LIVE_TESTS=1):

    tests/run_agent/test_create_openai_client_kwargs_isolation.py      PASS
    tests/run_agent/test_create_openai_client_reuse.py                 PASS (2)
    tests/run_agent/test_sequential_chats_live.py                      PASS

Socket options verified on the live httpx transport:

    _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)]
    = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3)

Sequential-chat reproduction of the NousResearch#10933 failure was explicitly
run against this patch — the defensive copy on main prevents the
closed transport from leaking back into ''self._client_kwargs'', so
every rebuild constructs a fresh transport.

Closes NousResearch#10324
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…args

Shallow-copy client_kwargs at the top of _create_openai_client() to
prevent in-place mutation from leaking back into self._client_kwargs.
Defensive fix that locks the contract for future httpx/transport work.

Cherry-picked from NousResearch#10978 by @taeuk178.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…esearch#10324) (NousResearch#11277)

Re-land of NousResearch#10933, now guarded by the tests in NousResearch#11266.

When a provider drops a TCP connection mid-stream, the socket can enter
CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal
arrives, so the httpx read timeout never triggers and the agent hangs
indefinitely. The other defenses (''_force_close_tcp_sockets'', stale
stream detector) all ride on the socket layer reporting the dead
connection, which it never does without probes.

Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT''
into the httpx transport. Kernel probes after 30s idle, retries every
10s, gives up after 3 → dead peer detected within ~60s instead of
hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux,
''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere
the socket options aren't available.

The original land (NousResearch#10933) mutated ''client_kwargs'' in place when it
injected the ''httpx.Client''. Since callers pass ''self._client_kwargs''
by reference, the injected client leaked into the instance state. After
the first request, the OpenAI SDK closed its ''http_client'' — including
the injected one. The next ''_create_openai_client'' call re-read the
now-closed ''httpx.Client'' from ''self._client_kwargs'' and every
subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError:
Cannot send a request, as the client has been closed'' (AlexKucera's
Discord report, 2026-04-16).

The defensive ''client_kwargs = dict(client_kwargs)'' copy already on
main (taeuk178's NousResearch#10978) means this injection only lands in the
per-call local copy. Each ''_create_openai_client'' invocation gets
its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired
''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild,
teardown, credential rotation), its ''httpx.Client'' closes with it
and the next call constructs a fresh one — no stale closed transport
can be reused.

Full 4-test matrix all green (unit + live with real OpenRouter round
trips, HERMES_LIVE_TESTS=1):

    tests/run_agent/test_create_openai_client_kwargs_isolation.py      PASS
    tests/run_agent/test_create_openai_client_reuse.py                 PASS (2)
    tests/run_agent/test_sequential_chats_live.py                      PASS

Socket options verified on the live httpx transport:

    _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)]
    = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3)

Sequential-chat reproduction of the NousResearch#10933 failure was explicitly
run against this patch — the defensive copy on main prevents the
closed transport from leaking back into ''self._client_kwargs'', so
every rebuild constructs a fresh transport.

Closes NousResearch#10324
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants