fix(run_agent): unfreeze first tool call after idle on macOS (#28834) by xxxigm · Pull Request #29004 · NousResearch/hermes-agent

xxxigm · 2026-05-20T00:51:45Z

What does this PR do?

Fixes #28834 — first tool call after a 2-3 minute idle pause freezes at [Calling tool: ...] until the user kicks the loop (e.g. with /goal).

Root cause is in AIAgent._build_keepalive_http_client (run_agent.py), which configures the TCP keepalive socket options applied to every provider connection. Two adjacent bugs combine to produce the freeze:

The Linux branch sets TCP_KEEPIDLE / TCP_KEEPINTVL / TCP_KEEPCNT together → dead peer detected in ~60 s. The macOS branch only sets TCP_KEEPALIVE (the idle knob's macOS name) and falls through, leaving KEEPINTVL and KEEPCNT at kernel defaults of 75 s × 8 ≈ 10 minutes. After a 2-3 min idle, the provider socket is silently dropped by intermediate NAT/firewall but macOS doesn't notice for nearly 10 more minutes.
Even with the keepalive fix, there's still a narrow window where httpx's keepalive pool hands out a zombie connection to the next request before the keepalive timer has had a chance to mark it dead. Without a connection-level retry, that request hangs / errors with no automatic recovery.

The fix is two small changes inside _build_keepalive_http_client, each in its own commit:

Split TCP_KEEPINTVL / TCP_KEEPCNT out of the TCP_KEEPIDLE branch and gate them on their own hasattr checks — both are exposed on macOS in Python ≥ 3.10. macOS now matches Linux's ~60 s detection budget.
Pass retries=1 to httpx.HTTPTransport so a stale-pool connection that beats the keepalive timer triggers a single transparent re-dial. httpx only retries connection-establishment failures, so this can't double-submit a half-sent request.

Related Issue

Fixes #28834.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

run_agent.py (+27/-3) — _build_keepalive_http_client:
- Set TCP_KEEPINTVL=10 and TCP_KEEPCNT=3 on both the Linux and macOS branches via independent hasattr checks (was Linux-only).
- Pass retries=1 to httpx.HTTPTransport so connection-establishment failures (stale pool connections) re-dial transparently.
tests/run_agent/test_keepalive_socket_options.py (+179, new) — three test classes covering the new contract:
- TestKeepaliveSharedKnobs — 4 cases running against the real host's socket module: SO_KEEPALIVE on, TCP_KEEPINTVL=10, TCP_KEEPCNT=3, and a 30 s idle warm-up under whichever of TCP_KEEPIDLE / TCP_KEEPALIVE the platform exposes.
- TestMacOSKeepaliveParity — 1 case stubbing sys.modules['socket'] with a macOS-flavored facade (no TCP_KEEPIDLE) so the test exercises the macOS branch even on Linux CI runners. Pins all three values (30 / 10 / 3) to lock the budget.
- TestStalePoolRetry — 1 case asserting the constructed httpx.HTTPTransport carries retries=1 via the underlying httpcore pool's _retries attribute.

No other production files touched. No config schema changes, no new env vars, no public-API surface change.

How to Test

Check out this branch and ensure .venv is set up: python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[all,dev]"

Run the new tests on their own:

scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py -v

Expected: 6 passed.

Run the wider OpenAI-client transport suite to confirm no cross-file regressions:

scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py \
  tests/run_agent/test_create_openai_client_reuse.py \
  tests/run_agent/test_create_openai_client_proxy_env.py \
  tests/run_agent/test_create_openai_client_kwargs_isolation.py \
  tests/run_agent/test_async_httpx_del_neuter.py \
  tests/run_agent/test_sequential_chats_live.py

Expected: 27 passed, 1 skipped.

(Optional, on macOS only — reproduces the original issue) Open hermes in interactive mode, run any tool call, idle 3 minutes, then send a request that requires another tool call. Before the fix: hangs at [Calling tool: ...]. After the fix: completes within the usual latency.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(run_agent): ... × 2, test(run_agent): ... × 1)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix (no unrelated commits)
I've run scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py and all tests pass
I've added tests for my changes
I've tested on my platform: macOS 15.2 (Darwin 24.6.0), Python 3.12

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — N/A (no public-API change; inline comments updated to call out the macOS gap + [Bug]: Agent tool calls freeze mid-execution — output stops at [Calling tool: ...] with no response, /goal unblocks #28834)
I've updated cli-config.yaml.example if I added/changed config keys — N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — macOS branch fixed to match Linux; Windows path is unchanged (the helper short-circuits via try/except, and Windows has no TCP_KEEPINTVL); fix is the documented behaviour on both supported platforms
I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

$ scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py -v
4 workers [6 items]
============================== 6 passed in 1.59s ===============================

$ scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py \
    tests/run_agent/test_create_openai_client_reuse.py \
    tests/run_agent/test_create_openai_client_proxy_env.py \
    tests/run_agent/test_create_openai_client_kwargs_isolation.py \
    tests/run_agent/test_async_httpx_del_neuter.py \
    tests/run_agent/test_sequential_chats_live.py
4 workers [28 items]
======================== 27 passed, 1 skipped in 3.94s =========================

The keepalive socket options in ``_build_keepalive_http_client`` set ``TCP_KEEPINTVL`` / ``TCP_KEEPCNT`` only on the Linux branch (``hasattr(socket, "TCP_KEEPIDLE")``). On macOS the constant is spelled ``TCP_KEEPALIVE``, so the macOS branch fell through with just the idle knob set and inherited the kernel defaults for the other two — ``KEEPINTVL=75 s`` × ``KEEPCNT=8`` ≈ **10 minutes** to detect a dead peer, versus Linux's ~60 s. That gap is the macOS side of NousResearch#28834. After a 2-3 minute idle pause the intermediate NAT / firewall silently drops the provider socket, the next ``chat.completions.create`` reuses the now-zombie connection from httpx's keepalive pool, and the agent hangs at "[Calling tool: …]" until macOS finally notices ~10 min later. Split the probe-interval / retry-count appends out of the ``TCP_KEEPIDLE`` branch and gate them on their own ``hasattr`` checks. ``socket.TCP_KEEPINTVL`` / ``socket.TCP_KEEPCNT`` are exposed on macOS in Python ≥ 3.10 (verified locally on darwin / CPython 3.13), so the macOS path now lands on the same 60 s detection budget as Linux. No behaviour change on Linux — the same three options end up in ``_sock_opts``, just appended in two stages instead of one.

TCP keepalive — even with the macOS parity fix in the previous commit — closes the dead-peer detection budget to ~60 s. But the window between "peer dropped the socket" and "kernel finally notices" is still long enough for httpx's keepalive pool to hand out a zombie connection to the very next request, which then fails the connection establishment / first write. ``httpx.HTTPTransport(retries=N)`` is the documented escape hatch for exactly this case: it retries *connection-level* failures (``httpx.ConnectError`` and friends) and does **not** retry mid-stream errors, so a half-sent ``chat.completions.create`` won't be resubmitted. Setting ``retries=1`` lets a single transparent re-dial turn the post-idle "[Calling tool: …]" freeze in NousResearch#28834 into a sub-second hiccup the user never sees. Anything higher would risk burning budget on a genuinely-down provider, so we stay conservative.

…rch#28834) Six tests across three classes covering both halves of the fix: * ``TestKeepaliveSharedKnobs`` — runs against the real host's socket module and asserts the four invariants every supported platform must satisfy: ``SO_KEEPALIVE`` on, ``TCP_KEEPINTVL=10``, ``TCP_KEEPCNT=3``, and either ``TCP_KEEPIDLE`` or ``TCP_KEEPALIVE`` carrying the 30 s warm-up. * ``TestMacOSKeepaliveParity`` — stubs ``sys.modules['socket']`` with a facade matching the macOS attribute surface (``TCP_KEEPALIVE`` + ``TCP_KEEPINTVL`` + ``TCP_KEEPCNT``, no ``TCP_KEEPIDLE``) so the test exercises the macOS branch even on Linux CI runners. Pins the documented 30 / 10 / 3 budget so a drive-by tuning change can't silently re-open the 10-minute dead-peer detection window on Darwin. * ``TestStalePoolRetry`` — asserts the constructed ``httpx.HTTPTransport`` carries ``retries=1`` so a zombie connection from the keepalive pool gets transparently re-dialled instead of hanging the next ``chat.completions`` call. Tests build a real ``httpx.Client`` and read back the socket options + transport retries — no production code paths mocked, so any future implementation that achieves the same observable contract through a different code path still passes.

qdaszx · 2026-05-20T02:30:45Z

I started looking into #28834 independently from the issue report before noticing this PR. My initial triage direction was around the CLI/agent-loop side: pending steer or activity-summary state after idle, and whether the first tool-call transition was getting stuck after the UI printed the tool-call marker.

After reading this PR, I think the transport-layer explanation is a better fit for the idle-specific pattern than my original hypothesis. The “first request/tool-call after a few minutes idle” symptom makes a stale pooled provider connection plausible, especially for OpenAI-compatible/custom provider paths.

I did a small verification pass from that angle:

On macOS, TCP_KEEPIDLE is absent while TCP_KEEPALIVE, TCP_KEEPINTVL, and TCP_KEEPCNT are present.
Building the PR keepalive client on macOS produced the expected socket options:
- SO_KEEPALIVE = 1
- TCP_KEEPALIVE = 30
- TCP_KEEPINTVL = 10
- TCP_KEEPCNT = 3
The targeted test file passes locally:
- python -m pytest tests/run_agent/test_keepalive_socket_options.py -q -o 'addopts='
- 6 passed
ruff passes on the changed files.
I also checked the HTTPTransport(retries=1) concern. In httpx/httpcore this flows into the connection-pool retry count and retries connect-establishment failures such as ConnectError/ConnectTimeout around connect_tcp; it does not appear to retry arbitrary mid-stream/read/write failures after a request has already been sent.

So from my side, the macOS socket-option part looks sound, and the retry change looks reasonable for stale connection recovery.

The only thing I would still separate from the code review is the red “Scan PR for critical supply chain risks” check. The visible PR diff only touches run_agent.py and tests/run_agent/test_keepalive_socket_options.py; the supply-chain finding may be coming from the workflow’s diff base / stale branch history rather than this actual patch. A rebase/rerun or checking whether the scanner uses a two-dot BASE..HEAD diff may resolve that.

Net: this PR changed my mind from “instrument the CLI/tool-call transition first” to “this transport-layer fix is probably the right first fix for #28834.” I don’t have code changes to request; I’d mainly want the red supply-chain check resolved or explained before merge.

xxxigm added 3 commits May 20, 2026 07:46

xxxigm force-pushed the fix/28834-tcp-keepalive-stale-connection branch from 6281480 to 229076b Compare May 20, 2026 00:53

alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround labels May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(run_agent): unfreeze first tool call after idle on macOS (#28834)#29004

fix(run_agent): unfreeze first tool call after idle on macOS (#28834)#29004
xxxigm wants to merge 3 commits into
NousResearch:mainfrom
xxxigm:fix/28834-tcp-keepalive-stale-connection

xxxigm commented May 20, 2026 •

edited

Loading

Uh oh!

qdaszx commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xxxigm commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Uh oh!

qdaszx commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xxxigm commented May 20, 2026 •

edited

Loading

qdaszx commented May 20, 2026 •

edited

Loading