fix: enable TCP keepalives to detect dead provider connections#10933
Merged
Conversation
When a custom provider drops a connection mid-stream, the TCP socket can enter CLOSE-WAIT and the httpx read timeout may never fire — epoll_wait blocks indefinitely because no data or error signal arrives. The agent hangs until manually killed. The existing defenses (httpx read timeout, stale stream detector, _force_close_tcp_sockets) are all time-based and work correctly once triggered, but they rely on the socket layer reporting the dead connection. Without TCP keepalives, the kernel has no reason to probe a silent connection. Fix: inject SO_KEEPALIVE + TCP_KEEPIDLE/KEEPINTVL/KEEPCNT into the httpx transport via socket_options. The kernel probes idle connections after 30s, retries every 10s, gives up after 3 failures — dead peer detected within ~60s instead of hanging forever. Platform-aware: uses TCP_KEEPIDLE on Linux, TCP_KEEPALIVE on macOS. Falls back silently if socket options aren't available (Windows, etc.). Closes #10324
taeuk178
added a commit
to taeuk178/hermes-agent
that referenced
this pull request
Apr 16, 2026
…kwargs Copy client_kwargs at entry so the function can't leak injected transport state back to callers. This is preventive hardening after NousResearch#10933 (reverted in e07dbde) demonstrated how easy it is to poison self._client_kwargs: the PR added an httpx.Client to the dict, which was then re-read on the next request after the transport was torn down, yielding APIConnectionError with cause "Cannot send a request, as the client has been closed" on every retry. The underlying dead-connection issue (NousResearch#10324) is still open, so the next transport/keepalive attempt will sit in the same codepath — the copy locks the contract and a regression test pins it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2 tasks
teknium1
added a commit
that referenced
this pull request
Apr 16, 2026
Two new tests in tests/run_agent/ that pin the user-visible invariant behind AlexKucera's Discord report (2026-04-16): no matter how a future keepalive / transport fix for #10324 plumbs sockets in, sequential chats on the same AIAgent instance must all succeed. test_create_openai_client_reuse.py (no network, runs in CI): - test_second_create_does_not_wrap_closed_transport_from_first back-to-back _create_openai_client calls must not hand the same http_client (after an SDK close) to the second construction - test_replace_primary_openai_client_survives_repeated_rebuilds three sequential rebuilds via the real _replace_primary_openai_client entrypoint must each install a live client test_sequential_chats_live.py (opt-in, HERMES_LIVE_TESTS=1): - test_three_sequential_chats_across_client_rebuild real OpenRouter round trips, with an explicit _replace_primary_openai_client call between turns 2 and 3. Error-sentinel detector treats 'API call failed after 3 retries' replies as failures instead of letting them pass the naive truthy check (which is how a first draft of this test missed the bug it was meant to catch). Validation: clean main (post-revert, defensive copy present) -> all 4 tests PASS broken #10933 state (keepalive injection, no defensive copy) -> all 4 tests FAIL with precise messages pointing at #10933 Companion to taeuk178's test_create_openai_client_kwargs_isolation.py, which pins the syntactic 'don't mutate input dict' half of the same contract. Together they catch both the specific mechanism of #10933 and any other reimplementation that breaks the sequential-call invariant.
teknium1
added a commit
that referenced
this pull request
Apr 16, 2026
#11266) Two new tests in tests/run_agent/ that pin the user-visible invariant behind AlexKucera's Discord report (2026-04-16): no matter how a future keepalive / transport fix for #10324 plumbs sockets in, sequential chats on the same AIAgent instance must all succeed. test_create_openai_client_reuse.py (no network, runs in CI): - test_second_create_does_not_wrap_closed_transport_from_first back-to-back _create_openai_client calls must not hand the same http_client (after an SDK close) to the second construction - test_replace_primary_openai_client_survives_repeated_rebuilds three sequential rebuilds via the real _replace_primary_openai_client entrypoint must each install a live client test_sequential_chats_live.py (opt-in, HERMES_LIVE_TESTS=1): - test_three_sequential_chats_across_client_rebuild real OpenRouter round trips, with an explicit _replace_primary_openai_client call between turns 2 and 3. Error-sentinel detector treats 'API call failed after 3 retries' replies as failures instead of letting them pass the naive truthy check (which is how a first draft of this test missed the bug it was meant to catch). Validation: clean main (post-revert, defensive copy present) -> all 4 tests PASS broken #10933 state (keepalive injection, no defensive copy) -> all 4 tests FAIL with precise messages pointing at #10933 Companion to taeuk178's test_create_openai_client_kwargs_isolation.py, which pins the syntactic 'don't mutate input dict' half of the same contract. Together they catch both the specific mechanism of #10933 and any other reimplementation that breaks the sequential-call invariant.
teknium1
added a commit
that referenced
this pull request
Apr 16, 2026
Re-land of #10933, now guarded by the tests in #11266. ## The original problem (#10324) When a provider drops a TCP connection mid-stream, the socket can enter CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal arrives, so the httpx read timeout never triggers and the agent hangs indefinitely. The other defenses (''_force_close_tcp_sockets'', stale stream detector) all ride on the socket layer reporting the dead connection, which it never does without probes. ## The fix Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT'' into the httpx transport. Kernel probes after 30s idle, retries every 10s, gives up after 3 → dead peer detected within ~60s instead of hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux, ''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere the socket options aren't available. ## Why this one doesn't regress like #10933 did The original land (#10933) mutated ''client_kwargs'' in place when it injected the ''httpx.Client''. Since callers pass ''self._client_kwargs'' by reference, the injected client leaked into the instance state. After the first request, the OpenAI SDK closed its ''http_client'' — including the injected one. The next ''_create_openai_client'' call re-read the now-closed ''httpx.Client'' from ''self._client_kwargs'' and every subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError: Cannot send a request, as the client has been closed'' (AlexKucera's Discord report, 2026-04-16). The defensive ''client_kwargs = dict(client_kwargs)'' copy already on main (taeuk178's #10978) means this injection only lands in the per-call local copy. Each ''_create_openai_client'' invocation gets its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired ''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild, teardown, credential rotation), its ''httpx.Client'' closes with it and the next call constructs a fresh one — no stale closed transport can be reused. ## Validation Full 4-test matrix all green (unit + live with real OpenRouter round trips, HERMES_LIVE_TESTS=1): tests/run_agent/test_create_openai_client_kwargs_isolation.py PASS tests/run_agent/test_create_openai_client_reuse.py PASS (2) tests/run_agent/test_sequential_chats_live.py PASS Socket options verified on the live httpx transport: _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)] = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3) Sequential-chat reproduction of the #10933 failure was explicitly run against this patch — the defensive copy on main prevents the closed transport from leaking back into ''self._client_kwargs'', so every rebuild constructs a fresh transport. Closes #10324
teknium1
added a commit
that referenced
this pull request
Apr 17, 2026
Re-land of #10933, now guarded by the tests in #11266. When a provider drops a TCP connection mid-stream, the socket can enter CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal arrives, so the httpx read timeout never triggers and the agent hangs indefinitely. The other defenses (''_force_close_tcp_sockets'', stale stream detector) all ride on the socket layer reporting the dead connection, which it never does without probes. Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT'' into the httpx transport. Kernel probes after 30s idle, retries every 10s, gives up after 3 → dead peer detected within ~60s instead of hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux, ''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere the socket options aren't available. The original land (#10933) mutated ''client_kwargs'' in place when it injected the ''httpx.Client''. Since callers pass ''self._client_kwargs'' by reference, the injected client leaked into the instance state. After the first request, the OpenAI SDK closed its ''http_client'' — including the injected one. The next ''_create_openai_client'' call re-read the now-closed ''httpx.Client'' from ''self._client_kwargs'' and every subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError: Cannot send a request, as the client has been closed'' (AlexKucera's Discord report, 2026-04-16). The defensive ''client_kwargs = dict(client_kwargs)'' copy already on main (taeuk178's #10978) means this injection only lands in the per-call local copy. Each ''_create_openai_client'' invocation gets its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired ''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild, teardown, credential rotation), its ''httpx.Client'' closes with it and the next call constructs a fresh one — no stale closed transport can be reused. Full 4-test matrix all green (unit + live with real OpenRouter round trips, HERMES_LIVE_TESTS=1): tests/run_agent/test_create_openai_client_kwargs_isolation.py PASS tests/run_agent/test_create_openai_client_reuse.py PASS (2) tests/run_agent/test_sequential_chats_live.py PASS Socket options verified on the live httpx transport: _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)] = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3) Sequential-chat reproduction of the #10933 failure was explicitly run against this patch — the defensive copy on main prevents the closed transport from leaking back into ''self._client_kwargs'', so every rebuild constructs a fresh transport. Closes #10324
teknium1
added a commit
that referenced
this pull request
Apr 17, 2026
… (#11277) Re-land of #10933, now guarded by the tests in #11266. When a provider drops a TCP connection mid-stream, the socket can enter CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal arrives, so the httpx read timeout never triggers and the agent hangs indefinitely. The other defenses (''_force_close_tcp_sockets'', stale stream detector) all ride on the socket layer reporting the dead connection, which it never does without probes. Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT'' into the httpx transport. Kernel probes after 30s idle, retries every 10s, gives up after 3 → dead peer detected within ~60s instead of hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux, ''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere the socket options aren't available. The original land (#10933) mutated ''client_kwargs'' in place when it injected the ''httpx.Client''. Since callers pass ''self._client_kwargs'' by reference, the injected client leaked into the instance state. After the first request, the OpenAI SDK closed its ''http_client'' — including the injected one. The next ''_create_openai_client'' call re-read the now-closed ''httpx.Client'' from ''self._client_kwargs'' and every subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError: Cannot send a request, as the client has been closed'' (AlexKucera's Discord report, 2026-04-16). The defensive ''client_kwargs = dict(client_kwargs)'' copy already on main (taeuk178's #10978) means this injection only lands in the per-call local copy. Each ''_create_openai_client'' invocation gets its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired ''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild, teardown, credential rotation), its ''httpx.Client'' closes with it and the next call constructs a fresh one — no stale closed transport can be reused. Full 4-test matrix all green (unit + live with real OpenRouter round trips, HERMES_LIVE_TESTS=1): tests/run_agent/test_create_openai_client_kwargs_isolation.py PASS tests/run_agent/test_create_openai_client_reuse.py PASS (2) tests/run_agent/test_sequential_chats_live.py PASS Socket options verified on the live httpx transport: _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)] = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3) Sequential-chat reproduction of the #10933 failure was explicitly run against this patch — the defensive copy on main prevents the closed transport from leaking back into ''self._client_kwargs'', so every rebuild constructs a fresh transport. Closes #10324
Merged
23 tasks
2 tasks
ulasbilgen
pushed a commit
to ulasbilgen/hermes-adhd-agent
that referenced
this pull request
May 1, 2026
…esearch#10324) (NousResearch#10933) When a custom provider drops a connection mid-stream, the TCP socket can enter CLOSE-WAIT and the httpx read timeout may never fire — epoll_wait blocks indefinitely because no data or error signal arrives. The agent hangs until manually killed. The existing defenses (httpx read timeout, stale stream detector, _force_close_tcp_sockets) are all time-based and work correctly once triggered, but they rely on the socket layer reporting the dead connection. Without TCP keepalives, the kernel has no reason to probe a silent connection. Fix: inject SO_KEEPALIVE + TCP_KEEPIDLE/KEEPINTVL/KEEPCNT into the httpx transport via socket_options. The kernel probes idle connections after 30s, retries every 10s, gives up after 3 failures — dead peer detected within ~60s instead of hanging forever. Platform-aware: uses TCP_KEEPIDLE on Linux, TCP_KEEPALIVE on macOS. Falls back silently if socket options aren't available (Windows, etc.). Closes NousResearch#10324
ulasbilgen
pushed a commit
to ulasbilgen/hermes-adhd-agent
that referenced
this pull request
May 1, 2026
…search#10933) (NousResearch#11266) Two new tests in tests/run_agent/ that pin the user-visible invariant behind AlexKucera's Discord report (2026-04-16): no matter how a future keepalive / transport fix for NousResearch#10324 plumbs sockets in, sequential chats on the same AIAgent instance must all succeed. test_create_openai_client_reuse.py (no network, runs in CI): - test_second_create_does_not_wrap_closed_transport_from_first back-to-back _create_openai_client calls must not hand the same http_client (after an SDK close) to the second construction - test_replace_primary_openai_client_survives_repeated_rebuilds three sequential rebuilds via the real _replace_primary_openai_client entrypoint must each install a live client test_sequential_chats_live.py (opt-in, HERMES_LIVE_TESTS=1): - test_three_sequential_chats_across_client_rebuild real OpenRouter round trips, with an explicit _replace_primary_openai_client call between turns 2 and 3. Error-sentinel detector treats 'API call failed after 3 retries' replies as failures instead of letting them pass the naive truthy check (which is how a first draft of this test missed the bug it was meant to catch). Validation: clean main (post-revert, defensive copy present) -> all 4 tests PASS broken NousResearch#10933 state (keepalive injection, no defensive copy) -> all 4 tests FAIL with precise messages pointing at NousResearch#10933 Companion to taeuk178's test_create_openai_client_kwargs_isolation.py, which pins the syntactic 'don't mutate input dict' half of the same contract. Together they catch both the specific mechanism of NousResearch#10933 and any other reimplementation that breaks the sequential-call invariant.
ulasbilgen
pushed a commit
to ulasbilgen/hermes-adhd-agent
that referenced
this pull request
May 1, 2026
…esearch#10324) (NousResearch#11277) Re-land of NousResearch#10933, now guarded by the tests in NousResearch#11266. When a provider drops a TCP connection mid-stream, the socket can enter CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal arrives, so the httpx read timeout never triggers and the agent hangs indefinitely. The other defenses (''_force_close_tcp_sockets'', stale stream detector) all ride on the socket layer reporting the dead connection, which it never does without probes. Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT'' into the httpx transport. Kernel probes after 30s idle, retries every 10s, gives up after 3 → dead peer detected within ~60s instead of hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux, ''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere the socket options aren't available. The original land (NousResearch#10933) mutated ''client_kwargs'' in place when it injected the ''httpx.Client''. Since callers pass ''self._client_kwargs'' by reference, the injected client leaked into the instance state. After the first request, the OpenAI SDK closed its ''http_client'' — including the injected one. The next ''_create_openai_client'' call re-read the now-closed ''httpx.Client'' from ''self._client_kwargs'' and every subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError: Cannot send a request, as the client has been closed'' (AlexKucera's Discord report, 2026-04-16). The defensive ''client_kwargs = dict(client_kwargs)'' copy already on main (taeuk178's NousResearch#10978) means this injection only lands in the per-call local copy. Each ''_create_openai_client'' invocation gets its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired ''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild, teardown, credential rotation), its ''httpx.Client'' closes with it and the next call constructs a fresh one — no stale closed transport can be reused. Full 4-test matrix all green (unit + live with real OpenRouter round trips, HERMES_LIVE_TESTS=1): tests/run_agent/test_create_openai_client_kwargs_isolation.py PASS tests/run_agent/test_create_openai_client_reuse.py PASS (2) tests/run_agent/test_sequential_chats_live.py PASS Socket options verified on the live httpx transport: _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)] = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3) Sequential-chat reproduction of the NousResearch#10933 failure was explicitly run against this patch — the defensive copy on main prevents the closed transport from leaking back into ''self._client_kwargs'', so every rebuild constructs a fresh transport. Closes NousResearch#10324
aj-nt
pushed a commit
to aj-nt/hermes-agent
that referenced
this pull request
May 1, 2026
…esearch#10324) (NousResearch#10933) When a custom provider drops a connection mid-stream, the TCP socket can enter CLOSE-WAIT and the httpx read timeout may never fire — epoll_wait blocks indefinitely because no data or error signal arrives. The agent hangs until manually killed. The existing defenses (httpx read timeout, stale stream detector, _force_close_tcp_sockets) are all time-based and work correctly once triggered, but they rely on the socket layer reporting the dead connection. Without TCP keepalives, the kernel has no reason to probe a silent connection. Fix: inject SO_KEEPALIVE + TCP_KEEPIDLE/KEEPINTVL/KEEPCNT into the httpx transport via socket_options. The kernel probes idle connections after 30s, retries every 10s, gives up after 3 failures — dead peer detected within ~60s instead of hanging forever. Platform-aware: uses TCP_KEEPIDLE on Linux, TCP_KEEPALIVE on macOS. Falls back silently if socket options aren't available (Windows, etc.). Closes NousResearch#10324
aj-nt
pushed a commit
to aj-nt/hermes-agent
that referenced
this pull request
May 1, 2026
…search#10933) (NousResearch#11266) Two new tests in tests/run_agent/ that pin the user-visible invariant behind AlexKucera's Discord report (2026-04-16): no matter how a future keepalive / transport fix for NousResearch#10324 plumbs sockets in, sequential chats on the same AIAgent instance must all succeed. test_create_openai_client_reuse.py (no network, runs in CI): - test_second_create_does_not_wrap_closed_transport_from_first back-to-back _create_openai_client calls must not hand the same http_client (after an SDK close) to the second construction - test_replace_primary_openai_client_survives_repeated_rebuilds three sequential rebuilds via the real _replace_primary_openai_client entrypoint must each install a live client test_sequential_chats_live.py (opt-in, HERMES_LIVE_TESTS=1): - test_three_sequential_chats_across_client_rebuild real OpenRouter round trips, with an explicit _replace_primary_openai_client call between turns 2 and 3. Error-sentinel detector treats 'API call failed after 3 retries' replies as failures instead of letting them pass the naive truthy check (which is how a first draft of this test missed the bug it was meant to catch). Validation: clean main (post-revert, defensive copy present) -> all 4 tests PASS broken NousResearch#10933 state (keepalive injection, no defensive copy) -> all 4 tests FAIL with precise messages pointing at NousResearch#10933 Companion to taeuk178's test_create_openai_client_kwargs_isolation.py, which pins the syntactic 'don't mutate input dict' half of the same contract. Together they catch both the specific mechanism of NousResearch#10933 and any other reimplementation that breaks the sequential-call invariant.
aj-nt
pushed a commit
to aj-nt/hermes-agent
that referenced
this pull request
May 1, 2026
…esearch#10324) (NousResearch#11277) Re-land of NousResearch#10933, now guarded by the tests in NousResearch#11266. When a provider drops a TCP connection mid-stream, the socket can enter CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal arrives, so the httpx read timeout never triggers and the agent hangs indefinitely. The other defenses (''_force_close_tcp_sockets'', stale stream detector) all ride on the socket layer reporting the dead connection, which it never does without probes. Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT'' into the httpx transport. Kernel probes after 30s idle, retries every 10s, gives up after 3 → dead peer detected within ~60s instead of hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux, ''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere the socket options aren't available. The original land (NousResearch#10933) mutated ''client_kwargs'' in place when it injected the ''httpx.Client''. Since callers pass ''self._client_kwargs'' by reference, the injected client leaked into the instance state. After the first request, the OpenAI SDK closed its ''http_client'' — including the injected one. The next ''_create_openai_client'' call re-read the now-closed ''httpx.Client'' from ''self._client_kwargs'' and every subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError: Cannot send a request, as the client has been closed'' (AlexKucera's Discord report, 2026-04-16). The defensive ''client_kwargs = dict(client_kwargs)'' copy already on main (taeuk178's NousResearch#10978) means this injection only lands in the per-call local copy. Each ''_create_openai_client'' invocation gets its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired ''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild, teardown, credential rotation), its ''httpx.Client'' closes with it and the next call constructs a fresh one — no stale closed transport can be reused. Full 4-test matrix all green (unit + live with real OpenRouter round trips, HERMES_LIVE_TESTS=1): tests/run_agent/test_create_openai_client_kwargs_isolation.py PASS tests/run_agent/test_create_openai_client_reuse.py PASS (2) tests/run_agent/test_sequential_chats_live.py PASS Socket options verified on the live httpx transport: _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)] = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3) Sequential-chat reproduction of the NousResearch#10933 failure was explicitly run against this patch — the defensive copy on main prevents the closed transport from leaking back into ''self._client_kwargs'', so every rebuild constructs a fresh transport. Closes NousResearch#10324
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
…esearch#10324) (NousResearch#10933) When a custom provider drops a connection mid-stream, the TCP socket can enter CLOSE-WAIT and the httpx read timeout may never fire — epoll_wait blocks indefinitely because no data or error signal arrives. The agent hangs until manually killed. The existing defenses (httpx read timeout, stale stream detector, _force_close_tcp_sockets) are all time-based and work correctly once triggered, but they rely on the socket layer reporting the dead connection. Without TCP keepalives, the kernel has no reason to probe a silent connection. Fix: inject SO_KEEPALIVE + TCP_KEEPIDLE/KEEPINTVL/KEEPCNT into the httpx transport via socket_options. The kernel probes idle connections after 30s, retries every 10s, gives up after 3 failures — dead peer detected within ~60s instead of hanging forever. Platform-aware: uses TCP_KEEPIDLE on Linux, TCP_KEEPALIVE on macOS. Falls back silently if socket options aren't available (Windows, etc.). Closes NousResearch#10324
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
…search#10933) (NousResearch#11266) Two new tests in tests/run_agent/ that pin the user-visible invariant behind AlexKucera's Discord report (2026-04-16): no matter how a future keepalive / transport fix for NousResearch#10324 plumbs sockets in, sequential chats on the same AIAgent instance must all succeed. test_create_openai_client_reuse.py (no network, runs in CI): - test_second_create_does_not_wrap_closed_transport_from_first back-to-back _create_openai_client calls must not hand the same http_client (after an SDK close) to the second construction - test_replace_primary_openai_client_survives_repeated_rebuilds three sequential rebuilds via the real _replace_primary_openai_client entrypoint must each install a live client test_sequential_chats_live.py (opt-in, HERMES_LIVE_TESTS=1): - test_three_sequential_chats_across_client_rebuild real OpenRouter round trips, with an explicit _replace_primary_openai_client call between turns 2 and 3. Error-sentinel detector treats 'API call failed after 3 retries' replies as failures instead of letting them pass the naive truthy check (which is how a first draft of this test missed the bug it was meant to catch). Validation: clean main (post-revert, defensive copy present) -> all 4 tests PASS broken NousResearch#10933 state (keepalive injection, no defensive copy) -> all 4 tests FAIL with precise messages pointing at NousResearch#10933 Companion to taeuk178's test_create_openai_client_kwargs_isolation.py, which pins the syntactic 'don't mutate input dict' half of the same contract. Together they catch both the specific mechanism of NousResearch#10933 and any other reimplementation that breaks the sequential-call invariant.
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
…esearch#10324) (NousResearch#11277) Re-land of NousResearch#10933, now guarded by the tests in NousResearch#11266. When a provider drops a TCP connection mid-stream, the socket can enter CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal arrives, so the httpx read timeout never triggers and the agent hangs indefinitely. The other defenses (''_force_close_tcp_sockets'', stale stream detector) all ride on the socket layer reporting the dead connection, which it never does without probes. Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT'' into the httpx transport. Kernel probes after 30s idle, retries every 10s, gives up after 3 → dead peer detected within ~60s instead of hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux, ''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere the socket options aren't available. The original land (NousResearch#10933) mutated ''client_kwargs'' in place when it injected the ''httpx.Client''. Since callers pass ''self._client_kwargs'' by reference, the injected client leaked into the instance state. After the first request, the OpenAI SDK closed its ''http_client'' — including the injected one. The next ''_create_openai_client'' call re-read the now-closed ''httpx.Client'' from ''self._client_kwargs'' and every subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError: Cannot send a request, as the client has been closed'' (AlexKucera's Discord report, 2026-04-16). The defensive ''client_kwargs = dict(client_kwargs)'' copy already on main (taeuk178's NousResearch#10978) means this injection only lands in the per-call local copy. Each ''_create_openai_client'' invocation gets its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired ''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild, teardown, credential rotation), its ''httpx.Client'' closes with it and the next call constructs a fresh one — no stale closed transport can be reused. Full 4-test matrix all green (unit + live with real OpenRouter round trips, HERMES_LIVE_TESTS=1): tests/run_agent/test_create_openai_client_kwargs_isolation.py PASS tests/run_agent/test_create_openai_client_reuse.py PASS (2) tests/run_agent/test_sequential_chats_live.py PASS Socket options verified on the live httpx transport: _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)] = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3) Sequential-chat reproduction of the NousResearch#10933 failure was explicitly run against this patch — the defensive copy on main prevents the closed transport from leaking back into ''self._client_kwargs'', so every rebuild constructs a fresh transport. Closes NousResearch#10324
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
…esearch#10324) (NousResearch#10933) When a custom provider drops a connection mid-stream, the TCP socket can enter CLOSE-WAIT and the httpx read timeout may never fire — epoll_wait blocks indefinitely because no data or error signal arrives. The agent hangs until manually killed. The existing defenses (httpx read timeout, stale stream detector, _force_close_tcp_sockets) are all time-based and work correctly once triggered, but they rely on the socket layer reporting the dead connection. Without TCP keepalives, the kernel has no reason to probe a silent connection. Fix: inject SO_KEEPALIVE + TCP_KEEPIDLE/KEEPINTVL/KEEPCNT into the httpx transport via socket_options. The kernel probes idle connections after 30s, retries every 10s, gives up after 3 failures — dead peer detected within ~60s instead of hanging forever. Platform-aware: uses TCP_KEEPIDLE on Linux, TCP_KEEPALIVE on macOS. Falls back silently if socket options aren't available (Windows, etc.). Closes NousResearch#10324
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
…search#10933) (NousResearch#11266) Two new tests in tests/run_agent/ that pin the user-visible invariant behind AlexKucera's Discord report (2026-04-16): no matter how a future keepalive / transport fix for NousResearch#10324 plumbs sockets in, sequential chats on the same AIAgent instance must all succeed. test_create_openai_client_reuse.py (no network, runs in CI): - test_second_create_does_not_wrap_closed_transport_from_first back-to-back _create_openai_client calls must not hand the same http_client (after an SDK close) to the second construction - test_replace_primary_openai_client_survives_repeated_rebuilds three sequential rebuilds via the real _replace_primary_openai_client entrypoint must each install a live client test_sequential_chats_live.py (opt-in, HERMES_LIVE_TESTS=1): - test_three_sequential_chats_across_client_rebuild real OpenRouter round trips, with an explicit _replace_primary_openai_client call between turns 2 and 3. Error-sentinel detector treats 'API call failed after 3 retries' replies as failures instead of letting them pass the naive truthy check (which is how a first draft of this test missed the bug it was meant to catch). Validation: clean main (post-revert, defensive copy present) -> all 4 tests PASS broken NousResearch#10933 state (keepalive injection, no defensive copy) -> all 4 tests FAIL with precise messages pointing at NousResearch#10933 Companion to taeuk178's test_create_openai_client_kwargs_isolation.py, which pins the syntactic 'don't mutate input dict' half of the same contract. Together they catch both the specific mechanism of NousResearch#10933 and any other reimplementation that breaks the sequential-call invariant.
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
…esearch#10324) (NousResearch#11277) Re-land of NousResearch#10933, now guarded by the tests in NousResearch#11266. When a provider drops a TCP connection mid-stream, the socket can enter CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal arrives, so the httpx read timeout never triggers and the agent hangs indefinitely. The other defenses (''_force_close_tcp_sockets'', stale stream detector) all ride on the socket layer reporting the dead connection, which it never does without probes. Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT'' into the httpx transport. Kernel probes after 30s idle, retries every 10s, gives up after 3 → dead peer detected within ~60s instead of hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux, ''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere the socket options aren't available. The original land (NousResearch#10933) mutated ''client_kwargs'' in place when it injected the ''httpx.Client''. Since callers pass ''self._client_kwargs'' by reference, the injected client leaked into the instance state. After the first request, the OpenAI SDK closed its ''http_client'' — including the injected one. The next ''_create_openai_client'' call re-read the now-closed ''httpx.Client'' from ''self._client_kwargs'' and every subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError: Cannot send a request, as the client has been closed'' (AlexKucera's Discord report, 2026-04-16). The defensive ''client_kwargs = dict(client_kwargs)'' copy already on main (taeuk178's NousResearch#10978) means this injection only lands in the per-call local copy. Each ''_create_openai_client'' invocation gets its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired ''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild, teardown, credential rotation), its ''httpx.Client'' closes with it and the next call constructs a fresh one — no stale closed transport can be reused. Full 4-test matrix all green (unit + live with real OpenRouter round trips, HERMES_LIVE_TESTS=1): tests/run_agent/test_create_openai_client_kwargs_isolation.py PASS tests/run_agent/test_create_openai_client_reuse.py PASS (2) tests/run_agent/test_sequential_chats_live.py PASS Socket options verified on the live httpx transport: _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)] = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3) Sequential-chat reproduction of the NousResearch#10933 failure was explicitly run against this patch — the defensive copy on main prevents the closed transport from leaking back into ''self._client_kwargs'', so every rebuild constructs a fresh transport. Closes NousResearch#10324
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
…esearch#10324) (NousResearch#10933) When a custom provider drops a connection mid-stream, the TCP socket can enter CLOSE-WAIT and the httpx read timeout may never fire — epoll_wait blocks indefinitely because no data or error signal arrives. The agent hangs until manually killed. The existing defenses (httpx read timeout, stale stream detector, _force_close_tcp_sockets) are all time-based and work correctly once triggered, but they rely on the socket layer reporting the dead connection. Without TCP keepalives, the kernel has no reason to probe a silent connection. Fix: inject SO_KEEPALIVE + TCP_KEEPIDLE/KEEPINTVL/KEEPCNT into the httpx transport via socket_options. The kernel probes idle connections after 30s, retries every 10s, gives up after 3 failures — dead peer detected within ~60s instead of hanging forever. Platform-aware: uses TCP_KEEPIDLE on Linux, TCP_KEEPALIVE on macOS. Falls back silently if socket options aren't available (Windows, etc.). Closes NousResearch#10324
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
…search#10933) (NousResearch#11266) Two new tests in tests/run_agent/ that pin the user-visible invariant behind AlexKucera's Discord report (2026-04-16): no matter how a future keepalive / transport fix for NousResearch#10324 plumbs sockets in, sequential chats on the same AIAgent instance must all succeed. test_create_openai_client_reuse.py (no network, runs in CI): - test_second_create_does_not_wrap_closed_transport_from_first back-to-back _create_openai_client calls must not hand the same http_client (after an SDK close) to the second construction - test_replace_primary_openai_client_survives_repeated_rebuilds three sequential rebuilds via the real _replace_primary_openai_client entrypoint must each install a live client test_sequential_chats_live.py (opt-in, HERMES_LIVE_TESTS=1): - test_three_sequential_chats_across_client_rebuild real OpenRouter round trips, with an explicit _replace_primary_openai_client call between turns 2 and 3. Error-sentinel detector treats 'API call failed after 3 retries' replies as failures instead of letting them pass the naive truthy check (which is how a first draft of this test missed the bug it was meant to catch). Validation: clean main (post-revert, defensive copy present) -> all 4 tests PASS broken NousResearch#10933 state (keepalive injection, no defensive copy) -> all 4 tests FAIL with precise messages pointing at NousResearch#10933 Companion to taeuk178's test_create_openai_client_kwargs_isolation.py, which pins the syntactic 'don't mutate input dict' half of the same contract. Together they catch both the specific mechanism of NousResearch#10933 and any other reimplementation that breaks the sequential-call invariant.
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
…esearch#10324) (NousResearch#11277) Re-land of NousResearch#10933, now guarded by the tests in NousResearch#11266. When a provider drops a TCP connection mid-stream, the socket can enter CLOSE-WAIT and ''epoll_wait'' may never fire — no data or error signal arrives, so the httpx read timeout never triggers and the agent hangs indefinitely. The other defenses (''_force_close_tcp_sockets'', stale stream detector) all ride on the socket layer reporting the dead connection, which it never does without probes. Inject ''SO_KEEPALIVE'' + ''TCP_KEEPIDLE''/''KEEPINTVL''/''KEEPCNT'' into the httpx transport. Kernel probes after 30s idle, retries every 10s, gives up after 3 → dead peer detected within ~60s instead of hanging forever. Platform-aware: ''TCP_KEEPIDLE'' on Linux, ''TCP_KEEPALIVE'' on macOS. Silent no-op on Windows or anywhere the socket options aren't available. The original land (NousResearch#10933) mutated ''client_kwargs'' in place when it injected the ''httpx.Client''. Since callers pass ''self._client_kwargs'' by reference, the injected client leaked into the instance state. After the first request, the OpenAI SDK closed its ''http_client'' — including the injected one. The next ''_create_openai_client'' call re-read the now-closed ''httpx.Client'' from ''self._client_kwargs'' and every subsequent chat raised ''APIConnectionError'' with cause ''RuntimeError: Cannot send a request, as the client has been closed'' (AlexKucera's Discord report, 2026-04-16). The defensive ''client_kwargs = dict(client_kwargs)'' copy already on main (taeuk178's NousResearch#10978) means this injection only lands in the per-call local copy. Each ''_create_openai_client'' invocation gets its OWN fresh ''httpx.Client'' whose lifetime is tied to the paired ''OpenAI'' client. When that ''OpenAI'' client is closed (rebuild, teardown, credential rotation), its ''httpx.Client'' closes with it and the next call constructs a fresh one — no stale closed transport can be reused. Full 4-test matrix all green (unit + live with real OpenRouter round trips, HERMES_LIVE_TESTS=1): tests/run_agent/test_create_openai_client_kwargs_isolation.py PASS tests/run_agent/test_create_openai_client_reuse.py PASS (2) tests/run_agent/test_sequential_chats_live.py PASS Socket options verified on the live httpx transport: _socket_options: [(1, 9, 1), (6, 4, 30), (6, 5, 10), (6, 6, 3)] = (SO_KEEPALIVE=1, TCP_KEEPIDLE=30s, TCP_KEEPINTVL=10s, TCP_KEEPCNT=3) Sequential-chat reproduction of the NousResearch#10933 failure was explicitly run against this patch — the defensive copy on main prevents the closed transport from leaking back into ''self._client_kwargs'', so every rebuild constructs a fresh transport. Closes NousResearch#10324
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #10324 — When a custom provider drops a connection mid-stream, the TCP socket enters CLOSE-WAIT and
epoll_waitblocks indefinitely. The agent hangs until manually killed.Why Existing Defenses Don't Catch This
The codebase already has:
_force_close_tcp_sockets(), butsock.shutdown(SHUT_RDWR)on CLOSE-WAIT may not unblock a concurrent blockedread()on some kernels._cleanup_dead_connections()— only runs between requests, not during streaming.All three are time-based and work once triggered, but they depend on the socket layer detecting the dead connection. Without keepalives, a silent CLOSE-WAIT socket generates no signal.
Fix
Inject
SO_KEEPALIVE+ keepalive parameters into the httpx transport:SO_KEEPALIVETCP_KEEPIDLETCP_KEEPINTVLTCP_KEEPCNTDead peer detected within ~60s instead of hanging forever. Platform-aware:
TCP_KEEPIDLEon Linux,TCP_KEEPALIVEon macOS. Falls back silently on unsupported platforms.Verification
Confirmed on a real socket that keepalive parameters are applied:
23 lines added. 75 provider parity tests pass.