Bug Description
Hermes agent processes hang indefinitely when the custom OpenAI-compatible provider (e.g. LiteLLM proxy) drops a connection mid-response. The process enters CLOSE-WAIT state and never recovers, requiring kill -9.
Steps to Reproduce
- Configure Hermes with a custom provider pointing to a local LiteLLM proxy (
http://127.0.0.1:4000/v1)
- Start a conversation that triggers a streaming API call
- Restart or interrupt the LiteLLM proxy while the request is in-flight
- Observe that the Hermes process hangs indefinitely
Expected Behavior
Hermes should detect the dead connection and either:
- Raise a timeout/disconnect error and surface it to the user, or
- Automatically reconnect and retry
Actual Behavior
The process hangs forever in epoll_wait() (visible as ep_poll in strace). The socket enters TCP CLOSE-WAIT — the remote closed its write side, but the httpx async event loop never wakes up because epoll_wait does not fire on a socket that received a FIN while no data was buffered.
ps output during hang:
25000 pts/3 00:03:42 hermes # stuck in ep_poll
65531 pts/0 00:00:36 hermes # stuck in ep_poll
ss -tnp shows both connections in CLOSE-WAIT to port 4000.
Only kill -9 recovers them. Any in-progress session is lost.
Root Cause Analysis
run_agent.py → _create_openai_client() creates an OpenAI(...) client with:
# agent/anthropic_adapter.py line 257
"timeout": Timeout(timeout=900.0, connect=10.0)
The 900s timeout is an httpx read/write inactivity timeout — it only fires when the kernel reports I/O readiness. A CLOSE-WAIT socket that received FIN but has no buffered data does not become readable via epoll until a read is attempted. Since httpx uses async I/O via epoll_wait, the timeout timer never starts, and the process blocks forever.
_cleanup_dead_connections() (line 4245) correctly probes sockets with MSG_PEEK | MSG_DONTWAIT and would detect this — but it only runs at the start of run_conversation() (line 7817), not during an in-flight API call. Once blocked inside epoll_wait, no cleanup code can intervene.
Proposed Fix
Enable TCP keepalives on the httpx transport so the kernel sends keepalive probes on idle connections. When the remote is gone, the probes fail and the kernel closes the socket, waking epoll_wait with an error:
import socket
import httpx
transport = httpx.HTTPTransport(
socket_options=[
(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 30), # start probes after 30s idle
(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 10), # probe every 10s
(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3), # 3 failed probes = dead
]
)
client = OpenAI(..., http_client=httpx.Client(transport=transport))
This gives a ~60s worst-case detection window (30 + 3×10) instead of infinite hang. The existing error_classifier.py disconnect patterns would then fire correctly.
Environment
- OS: Linux (Ubuntu)
- Provider: custom OpenAI-compatible (LiteLLM proxy on
127.0.0.1:4000)
api_mode: chat_completions
- Hermes commit:
10494b42
- httpx version: see
pip show httpx
- Python: 3.x
Workaround
Kill hung processes manually:
The existing _cleanup_dead_connections() pre-turn check prevents the next message from hanging on a stale socket, but cannot rescue a process already blocked in epoll_wait.
Bug Description
Hermes agent processes hang indefinitely when the custom OpenAI-compatible provider (e.g. LiteLLM proxy) drops a connection mid-response. The process enters CLOSE-WAIT state and never recovers, requiring
kill -9.Steps to Reproduce
http://127.0.0.1:4000/v1)Expected Behavior
Hermes should detect the dead connection and either:
Actual Behavior
The process hangs forever in
epoll_wait()(visible asep_pollinstrace). The socket enters TCP CLOSE-WAIT — the remote closed its write side, but the httpx async event loop never wakes up becauseepoll_waitdoes not fire on a socket that received a FIN while no data was buffered.psoutput during hang:ss -tnpshows both connections inCLOSE-WAITto port 4000.Only
kill -9recovers them. Any in-progress session is lost.Root Cause Analysis
run_agent.py→_create_openai_client()creates anOpenAI(...)client with:The 900s
timeoutis an httpx read/write inactivity timeout — it only fires when the kernel reports I/O readiness. A CLOSE-WAIT socket that received FIN but has no buffered data does not become readable via epoll until a read is attempted. Since httpx uses async I/O viaepoll_wait, the timeout timer never starts, and the process blocks forever._cleanup_dead_connections()(line 4245) correctly probes sockets withMSG_PEEK | MSG_DONTWAITand would detect this — but it only runs at the start ofrun_conversation()(line 7817), not during an in-flight API call. Once blocked insideepoll_wait, no cleanup code can intervene.Proposed Fix
Enable TCP keepalives on the httpx transport so the kernel sends keepalive probes on idle connections. When the remote is gone, the probes fail and the kernel closes the socket, waking
epoll_waitwith an error:This gives a ~60s worst-case detection window (30 + 3×10) instead of infinite hang. The existing
error_classifier.pydisconnect patterns would then fire correctly.Environment
127.0.0.1:4000)api_mode:chat_completions10494b42pip show httpxWorkaround
Kill hung processes manually:
The existing
_cleanup_dead_connections()pre-turn check prevents the next message from hanging on a stale socket, but cannot rescue a process already blocked inepoll_wait.