Skip to content

Agent hangs indefinitely on CLOSE-WAIT when custom provider drops connection mid-response #10324

@rfilgueiras

Description

@rfilgueiras

Bug Description

Hermes agent processes hang indefinitely when the custom OpenAI-compatible provider (e.g. LiteLLM proxy) drops a connection mid-response. The process enters CLOSE-WAIT state and never recovers, requiring kill -9.

Steps to Reproduce

  1. Configure Hermes with a custom provider pointing to a local LiteLLM proxy (http://127.0.0.1:4000/v1)
  2. Start a conversation that triggers a streaming API call
  3. Restart or interrupt the LiteLLM proxy while the request is in-flight
  4. Observe that the Hermes process hangs indefinitely

Expected Behavior

Hermes should detect the dead connection and either:

  • Raise a timeout/disconnect error and surface it to the user, or
  • Automatically reconnect and retry

Actual Behavior

The process hangs forever in epoll_wait() (visible as ep_poll in strace). The socket enters TCP CLOSE-WAIT — the remote closed its write side, but the httpx async event loop never wakes up because epoll_wait does not fire on a socket that received a FIN while no data was buffered.

ps output during hang:

25000 pts/3    00:03:42 hermes   # stuck in ep_poll
65531 pts/0    00:00:36 hermes   # stuck in ep_poll

ss -tnp shows both connections in CLOSE-WAIT to port 4000.

Only kill -9 recovers them. Any in-progress session is lost.

Root Cause Analysis

run_agent.py_create_openai_client() creates an OpenAI(...) client with:

# agent/anthropic_adapter.py line 257
"timeout": Timeout(timeout=900.0, connect=10.0)

The 900s timeout is an httpx read/write inactivity timeout — it only fires when the kernel reports I/O readiness. A CLOSE-WAIT socket that received FIN but has no buffered data does not become readable via epoll until a read is attempted. Since httpx uses async I/O via epoll_wait, the timeout timer never starts, and the process blocks forever.

_cleanup_dead_connections() (line 4245) correctly probes sockets with MSG_PEEK | MSG_DONTWAIT and would detect this — but it only runs at the start of run_conversation() (line 7817), not during an in-flight API call. Once blocked inside epoll_wait, no cleanup code can intervene.

Proposed Fix

Enable TCP keepalives on the httpx transport so the kernel sends keepalive probes on idle connections. When the remote is gone, the probes fail and the kernel closes the socket, waking epoll_wait with an error:

import socket
import httpx

transport = httpx.HTTPTransport(
    socket_options=[
        (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
        (socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 30),   # start probes after 30s idle
        (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 10),  # probe every 10s
        (socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3),     # 3 failed probes = dead
    ]
)
client = OpenAI(..., http_client=httpx.Client(transport=transport))

This gives a ~60s worst-case detection window (30 + 3×10) instead of infinite hang. The existing error_classifier.py disconnect patterns would then fire correctly.

Environment

  • OS: Linux (Ubuntu)
  • Provider: custom OpenAI-compatible (LiteLLM proxy on 127.0.0.1:4000)
  • api_mode: chat_completions
  • Hermes commit: 10494b42
  • httpx version: see pip show httpx
  • Python: 3.x

Workaround

Kill hung processes manually:

kill -9 <pid>

The existing _cleanup_dead_connections() pre-turn check prevents the next message from hanging on a stale socket, but cannot rescue a process already blocked in epoll_wait.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions