fix(run_agent): unfreeze first tool call after idle on macOS (#28834)#29004
fix(run_agent): unfreeze first tool call after idle on macOS (#28834)#29004xxxigm wants to merge 3 commits into
Conversation
The keepalive socket options in ``_build_keepalive_http_client`` set ``TCP_KEEPINTVL`` / ``TCP_KEEPCNT`` only on the Linux branch (``hasattr(socket, "TCP_KEEPIDLE")``). On macOS the constant is spelled ``TCP_KEEPALIVE``, so the macOS branch fell through with just the idle knob set and inherited the kernel defaults for the other two — ``KEEPINTVL=75 s`` × ``KEEPCNT=8`` ≈ **10 minutes** to detect a dead peer, versus Linux's ~60 s. That gap is the macOS side of NousResearch#28834. After a 2-3 minute idle pause the intermediate NAT / firewall silently drops the provider socket, the next ``chat.completions.create`` reuses the now-zombie connection from httpx's keepalive pool, and the agent hangs at "[Calling tool: …]" until macOS finally notices ~10 min later. Split the probe-interval / retry-count appends out of the ``TCP_KEEPIDLE`` branch and gate them on their own ``hasattr`` checks. ``socket.TCP_KEEPINTVL`` / ``socket.TCP_KEEPCNT`` are exposed on macOS in Python ≥ 3.10 (verified locally on darwin / CPython 3.13), so the macOS path now lands on the same 60 s detection budget as Linux. No behaviour change on Linux — the same three options end up in ``_sock_opts``, just appended in two stages instead of one.
TCP keepalive — even with the macOS parity fix in the previous commit — closes the dead-peer detection budget to ~60 s. But the window between "peer dropped the socket" and "kernel finally notices" is still long enough for httpx's keepalive pool to hand out a zombie connection to the very next request, which then fails the connection establishment / first write. ``httpx.HTTPTransport(retries=N)`` is the documented escape hatch for exactly this case: it retries *connection-level* failures (``httpx.ConnectError`` and friends) and does **not** retry mid-stream errors, so a half-sent ``chat.completions.create`` won't be resubmitted. Setting ``retries=1`` lets a single transparent re-dial turn the post-idle "[Calling tool: …]" freeze in NousResearch#28834 into a sub-second hiccup the user never sees. Anything higher would risk burning budget on a genuinely-down provider, so we stay conservative.
…rch#28834) Six tests across three classes covering both halves of the fix: * ``TestKeepaliveSharedKnobs`` — runs against the real host's socket module and asserts the four invariants every supported platform must satisfy: ``SO_KEEPALIVE`` on, ``TCP_KEEPINTVL=10``, ``TCP_KEEPCNT=3``, and either ``TCP_KEEPIDLE`` or ``TCP_KEEPALIVE`` carrying the 30 s warm-up. * ``TestMacOSKeepaliveParity`` — stubs ``sys.modules['socket']`` with a facade matching the macOS attribute surface (``TCP_KEEPALIVE`` + ``TCP_KEEPINTVL`` + ``TCP_KEEPCNT``, no ``TCP_KEEPIDLE``) so the test exercises the macOS branch even on Linux CI runners. Pins the documented 30 / 10 / 3 budget so a drive-by tuning change can't silently re-open the 10-minute dead-peer detection window on Darwin. * ``TestStalePoolRetry`` — asserts the constructed ``httpx.HTTPTransport`` carries ``retries=1`` so a zombie connection from the keepalive pool gets transparently re-dialled instead of hanging the next ``chat.completions`` call. Tests build a real ``httpx.Client`` and read back the socket options + transport retries — no production code paths mocked, so any future implementation that achieves the same observable contract through a different code path still passes.
6281480 to
229076b
Compare
|
I started looking into #28834 independently from the issue report before noticing this PR. My initial triage direction was around the CLI/agent-loop side: pending steer or activity-summary state after idle, and whether the first tool-call transition was getting stuck after the UI printed the tool-call marker. After reading this PR, I think the transport-layer explanation is a better fit for the idle-specific pattern than my original hypothesis. The “first request/tool-call after a few minutes idle” symptom makes a stale pooled provider connection plausible, especially for OpenAI-compatible/custom provider paths. I did a small verification pass from that angle:
So from my side, the macOS socket-option part looks sound, and the retry change looks reasonable for stale connection recovery. The only thing I would still separate from the code review is the red “Scan PR for critical supply chain risks” check. The visible PR diff only touches Net: this PR changed my mind from “instrument the CLI/tool-call transition first” to “this transport-layer fix is probably the right first fix for #28834.” I don’t have code changes to request; I’d mainly want the red supply-chain check resolved or explained before merge. |
What does this PR do?
Fixes #28834 — first tool call after a 2-3 minute idle pause freezes at
[Calling tool: ...]until the user kicks the loop (e.g. with/goal).Root cause is in
AIAgent._build_keepalive_http_client(run_agent.py), which configures the TCP keepalive socket options applied to every provider connection. Two adjacent bugs combine to produce the freeze:TCP_KEEPIDLE/TCP_KEEPINTVL/TCP_KEEPCNTtogether → dead peer detected in ~60 s. The macOS branch only setsTCP_KEEPALIVE(the idle knob's macOS name) and falls through, leavingKEEPINTVLandKEEPCNTat kernel defaults of 75 s × 8 ≈ 10 minutes. After a 2-3 min idle, the provider socket is silently dropped by intermediate NAT/firewall but macOS doesn't notice for nearly 10 more minutes.The fix is two small changes inside
_build_keepalive_http_client, each in its own commit:TCP_KEEPINTVL/TCP_KEEPCNTout of theTCP_KEEPIDLEbranch and gate them on their ownhasattrchecks — both are exposed on macOS in Python ≥ 3.10. macOS now matches Linux's ~60 s detection budget.retries=1tohttpx.HTTPTransportso a stale-pool connection that beats the keepalive timer triggers a single transparent re-dial.httpxonly retries connection-establishment failures, so this can't double-submit a half-sent request.Related Issue
Fixes #28834.
Type of Change
Changes Made
run_agent.py(+27/-3) —_build_keepalive_http_client:TCP_KEEPINTVL=10andTCP_KEEPCNT=3on both the Linux and macOS branches via independenthasattrchecks (was Linux-only).retries=1tohttpx.HTTPTransportso connection-establishment failures (stale pool connections) re-dial transparently.tests/run_agent/test_keepalive_socket_options.py(+179, new) — three test classes covering the new contract:TestKeepaliveSharedKnobs— 4 cases running against the real host's socket module:SO_KEEPALIVEon,TCP_KEEPINTVL=10,TCP_KEEPCNT=3, and a 30 s idle warm-up under whichever ofTCP_KEEPIDLE/TCP_KEEPALIVEthe platform exposes.TestMacOSKeepaliveParity— 1 case stubbingsys.modules['socket']with a macOS-flavored facade (noTCP_KEEPIDLE) so the test exercises the macOS branch even on Linux CI runners. Pins all three values (30 / 10 / 3) to lock the budget.TestStalePoolRetry— 1 case asserting the constructedhttpx.HTTPTransportcarriesretries=1via the underlyinghttpcorepool's_retriesattribute.No other production files touched. No config schema changes, no new env vars, no public-API surface change.
How to Test
.venvis set up:python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[all,dev]"hermesin interactive mode, run any tool call, idle 3 minutes, then send a request that requires another tool call. Before the fix: hangs at[Calling tool: ...]. After the fix: completes within the usual latency.Checklist
Code
fix(run_agent): ...× 2,test(run_agent): ...× 1)scripts/run_tests.sh tests/run_agent/test_keepalive_socket_options.pyand all tests passDocumentation & Housekeeping
docs/, docstrings) — N/A (no public-API change; inline comments updated to call out the macOS gap + [Bug]: Agent tool calls freeze mid-execution — output stops at [Calling tool: ...] with no response, /goal unblocks #28834)cli-config.yaml.exampleif I added/changed config keys — N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — N/Atry/except, and Windows has noTCP_KEEPINTVL); fix is the documented behaviour on both supported platformsScreenshots / Logs