Skip to content

fix(run_agent): unfreeze first tool call after idle on macOS (#28834)#29004

Open
xxxigm wants to merge 3 commits into
NousResearch:mainfrom
xxxigm:fix/28834-tcp-keepalive-stale-connection
Open

fix(run_agent): unfreeze first tool call after idle on macOS (#28834)#29004
xxxigm wants to merge 3 commits into
NousResearch:mainfrom
xxxigm:fix/28834-tcp-keepalive-stale-connection

Conversation

@xxxigm

@xxxigm xxxigm commented May 20, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes #28834 — first tool call after a 2-3 minute idle pause freezes at [Calling tool: ...] until the user kicks the loop (e.g. with /goal).

Root cause is in AIAgent._build_keepalive_http_client (run_agent.py), which configures the TCP keepalive socket options applied to every provider connection. Two adjacent bugs combine to produce the freeze:

  1. The Linux branch sets TCP_KEEPIDLE / TCP_KEEPINTVL / TCP_KEEPCNT together → dead peer detected in ~60 s. The macOS branch only sets TCP_KEEPALIVE (the idle knob's macOS name) and falls through, leaving KEEPINTVL and KEEPCNT at kernel defaults of 75 s × 8 ≈ 10 minutes. After a 2-3 min idle, the provider socket is silently dropped by intermediate NAT/firewall but macOS doesn't notice for nearly 10 more minutes.
  2. Even with the keepalive fix, there's still a narrow window where httpx's keepalive pool hands out a zombie connection to the next request before the keepalive timer has had a chance to mark it dead. Without a connection-level retry, that request hangs / errors with no automatic recovery.

The fix is two small changes inside _build_keepalive_http_client, each in its own commit:

  • Split TCP_KEEPINTVL / TCP_KEEPCNT out of the TCP_KEEPIDLE branch and gate them on their own hasattr checks — both are exposed on macOS in Python ≥ 3.10. macOS now matches Linux's ~60 s detection budget.
  • Pass retries=1 to httpx.HTTPTransport so a stale-pool connection that beats the keepalive timer triggers a single transparent re-dial. httpx only retries connection-establishment failures, so this can't double-submit a half-sent request.

Related Issue

Fixes #28834.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • run_agent.py (+27/-3) — _build_keepalive_http_client:
    • Set TCP_KEEPINTVL=10 and TCP_KEEPCNT=3 on both the Linux and macOS branches via independent hasattr checks (was Linux-only).
    • Pass retries=1 to httpx.HTTPTransport so connection-establishment failures (stale pool connections) re-dial transparently.
  • tests/run_agent/test_keepalive_socket_options.py (+179, new) — three test classes covering the new contract:
    • TestKeepaliveSharedKnobs — 4 cases running against the real host's socket module: SO_KEEPALIVE on, TCP_KEEPINTVL=10, TCP_KEEPCNT=3, and a 30 s idle warm-up under whichever of TCP_KEEPIDLE / TCP_KEEPALIVE the platform exposes.
    • TestMacOSKeepaliveParity — 1 case stubbing sys.modules['socket'] with a macOS-flavored facade (no TCP_KEEPIDLE) so the test exercises the macOS branch even on Linux CI runners. Pins all three values (30 / 10 / 3) to lock the budget.
    • TestStalePoolRetry — 1 case asserting the constructed httpx.HTTPTransport carries retries=1 via the underlying httpcore pool's _retries attribute.

No other production files touched. No config schema changes, no new env vars, no public-API surface change.

How to Test

  1. Check out this branch and ensure .venv is set up: python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[all,dev]"
  2. Run the new tests on their own:
    scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py -v
    
    Expected: 6 passed.
  3. Run the wider OpenAI-client transport suite to confirm no cross-file regressions:
    scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py \
      tests/run_agent/test_create_openai_client_reuse.py \
      tests/run_agent/test_create_openai_client_proxy_env.py \
      tests/run_agent/test_create_openai_client_kwargs_isolation.py \
      tests/run_agent/test_async_httpx_del_neuter.py \
      tests/run_agent/test_sequential_chats_live.py
    
    Expected: 27 passed, 1 skipped.
  4. (Optional, on macOS only — reproduces the original issue) Open hermes in interactive mode, run any tool call, idle 3 minutes, then send a request that requires another tool call. Before the fix: hangs at [Calling tool: ...]. After the fix: completes within the usual latency.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(run_agent): ... × 2, test(run_agent): ... × 1)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (no unrelated commits)
  • I've run scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py and all tests pass
  • I've added tests for my changes
  • I've tested on my platform: macOS 15.2 (Darwin 24.6.0), Python 3.12

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — N/A (no public-API change; inline comments updated to call out the macOS gap + [Bug]: Agent tool calls freeze mid-execution — output stops at [Calling tool: ...] with no response, /goal unblocks #28834)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — macOS branch fixed to match Linux; Windows path is unchanged (the helper short-circuits via try/except, and Windows has no TCP_KEEPINTVL); fix is the documented behaviour on both supported platforms
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

$ scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py -v
4 workers [6 items]
============================== 6 passed in 1.59s ===============================

$ scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.py \
    tests/run_agent/test_create_openai_client_reuse.py \
    tests/run_agent/test_create_openai_client_proxy_env.py \
    tests/run_agent/test_create_openai_client_kwargs_isolation.py \
    tests/run_agent/test_async_httpx_del_neuter.py \
    tests/run_agent/test_sequential_chats_live.py
4 workers [28 items]
======================== 27 passed, 1 skipped in 3.94s =========================

xxxigm added 3 commits May 20, 2026 07:46
The keepalive socket options in ``_build_keepalive_http_client`` set
``TCP_KEEPINTVL`` / ``TCP_KEEPCNT`` only on the Linux branch
(``hasattr(socket, "TCP_KEEPIDLE")``).  On macOS the constant is
spelled ``TCP_KEEPALIVE``, so the macOS branch fell through with
just the idle knob set and inherited the kernel defaults for the
other two — ``KEEPINTVL=75 s`` × ``KEEPCNT=8`` ≈ **10 minutes** to
detect a dead peer, versus Linux's ~60 s.

That gap is the macOS side of NousResearch#28834.  After a 2-3 minute idle pause
the intermediate NAT / firewall silently drops the provider socket,
the next ``chat.completions.create`` reuses the now-zombie connection
from httpx's keepalive pool, and the agent hangs at "[Calling tool:
…]" until macOS finally notices ~10 min later.

Split the probe-interval / retry-count appends out of the
``TCP_KEEPIDLE`` branch and gate them on their own ``hasattr``
checks.  ``socket.TCP_KEEPINTVL`` / ``socket.TCP_KEEPCNT`` are
exposed on macOS in Python ≥ 3.10 (verified locally on
darwin / CPython 3.13), so the macOS path now lands on the same
60 s detection budget as Linux.

No behaviour change on Linux — the same three options end up in
``_sock_opts``, just appended in two stages instead of one.
TCP keepalive — even with the macOS parity fix in the previous
commit — closes the dead-peer detection budget to ~60 s.  But the
window between "peer dropped the socket" and "kernel finally
notices" is still long enough for httpx's keepalive pool to hand
out a zombie connection to the very next request, which then
fails the connection establishment / first write.

``httpx.HTTPTransport(retries=N)`` is the documented escape hatch
for exactly this case: it retries *connection-level* failures
(``httpx.ConnectError`` and friends) and does **not** retry
mid-stream errors, so a half-sent ``chat.completions.create``
won't be resubmitted.

Setting ``retries=1`` lets a single transparent re-dial turn the
post-idle "[Calling tool: …]" freeze in NousResearch#28834 into a sub-second
hiccup the user never sees.  Anything higher would risk burning
budget on a genuinely-down provider, so we stay conservative.
…rch#28834)

Six tests across three classes covering both halves of the fix:

* ``TestKeepaliveSharedKnobs`` — runs against the real host's
  socket module and asserts the four invariants every supported
  platform must satisfy: ``SO_KEEPALIVE`` on, ``TCP_KEEPINTVL=10``,
  ``TCP_KEEPCNT=3``, and either ``TCP_KEEPIDLE`` or ``TCP_KEEPALIVE``
  carrying the 30 s warm-up.

* ``TestMacOSKeepaliveParity`` — stubs ``sys.modules['socket']``
  with a facade matching the macOS attribute surface
  (``TCP_KEEPALIVE`` + ``TCP_KEEPINTVL`` + ``TCP_KEEPCNT``, no
  ``TCP_KEEPIDLE``) so the test exercises the macOS branch even on
  Linux CI runners.  Pins the documented 30 / 10 / 3 budget so a
  drive-by tuning change can't silently re-open the 10-minute
  dead-peer detection window on Darwin.

* ``TestStalePoolRetry`` — asserts the constructed
  ``httpx.HTTPTransport`` carries ``retries=1`` so a zombie
  connection from the keepalive pool gets transparently re-dialled
  instead of hanging the next ``chat.completions`` call.

Tests build a real ``httpx.Client`` and read back the socket
options + transport retries — no production code paths mocked,
so any future implementation that achieves the same observable
contract through a different code path still passes.
@xxxigm xxxigm force-pushed the fix/28834-tcp-keepalive-stale-connection branch from 6281480 to 229076b Compare May 20, 2026 00:53
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround labels May 20, 2026
@qdaszx

qdaszx commented May 20, 2026

Copy link
Copy Markdown

I started looking into #28834 independently from the issue report before noticing this PR. My initial triage direction was around the CLI/agent-loop side: pending steer or activity-summary state after idle, and whether the first tool-call transition was getting stuck after the UI printed the tool-call marker.

After reading this PR, I think the transport-layer explanation is a better fit for the idle-specific pattern than my original hypothesis. The “first request/tool-call after a few minutes idle” symptom makes a stale pooled provider connection plausible, especially for OpenAI-compatible/custom provider paths.

I did a small verification pass from that angle:

  • On macOS, TCP_KEEPIDLE is absent while TCP_KEEPALIVE, TCP_KEEPINTVL, and TCP_KEEPCNT are present.
  • Building the PR keepalive client on macOS produced the expected socket options:
    • SO_KEEPALIVE = 1
    • TCP_KEEPALIVE = 30
    • TCP_KEEPINTVL = 10
    • TCP_KEEPCNT = 3
  • The targeted test file passes locally:
    • python -m pytest tests/run_agent/test_keepalive_socket_options.py -q -o 'addopts='
    • 6 passed
  • ruff passes on the changed files.
  • I also checked the HTTPTransport(retries=1) concern. In httpx/httpcore this flows into the connection-pool retry count and retries connect-establishment failures such as ConnectError/ConnectTimeout around connect_tcp; it does not appear to retry arbitrary mid-stream/read/write failures after a request has already been sent.

So from my side, the macOS socket-option part looks sound, and the retry change looks reasonable for stale connection recovery.

The only thing I would still separate from the code review is the red “Scan PR for critical supply chain risks” check. The visible PR diff only touches run_agent.py and tests/run_agent/test_keepalive_socket_options.py; the supply-chain finding may be coming from the workflow’s diff base / stale branch history rather than this actual patch. A rebase/rerun or checking whether the scanner uses a two-dot BASE..HEAD diff may resolve that.

Net: this PR changed my mind from “instrument the CLI/tool-call transition first” to “this transport-layer fix is probably the right first fix for #28834.” I don’t have code changes to request; I’d mainly want the red supply-chain check resolved or explained before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Agent tool calls freeze mid-execution — output stops at [Calling tool: ...] with no response, /goal unblocks

3 participants