Agent hangs indefinitely on CLOSE-WAIT when custom provider drops connection mid-response

## Bug Description

Hermes agent processes hang indefinitely when the custom OpenAI-compatible provider (e.g. LiteLLM proxy) drops a connection mid-response. The process enters CLOSE-WAIT state and never recovers, requiring `kill -9`.

## Steps to Reproduce

1. Configure Hermes with a custom provider pointing to a local LiteLLM proxy (`http://127.0.0.1:4000/v1`)
2. Start a conversation that triggers a streaming API call
3. Restart or interrupt the LiteLLM proxy while the request is in-flight
4. Observe that the Hermes process hangs indefinitely

## Expected Behavior

Hermes should detect the dead connection and either:
- Raise a timeout/disconnect error and surface it to the user, or
- Automatically reconnect and retry

## Actual Behavior

The process hangs forever in `epoll_wait()` (visible as `ep_poll` in `strace`). The socket enters TCP CLOSE-WAIT — the remote closed its write side, but the httpx async event loop never wakes up because `epoll_wait` does not fire on a socket that received a FIN while no data was buffered.

`ps` output during hang:
```
25000 pts/3    00:03:42 hermes   # stuck in ep_poll
65531 pts/0    00:00:36 hermes   # stuck in ep_poll
```

`ss -tnp` shows both connections in `CLOSE-WAIT` to port 4000.

Only `kill -9` recovers them. Any in-progress session is lost.

## Root Cause Analysis

`run_agent.py` → `_create_openai_client()` creates an `OpenAI(...)` client with:

```python
# agent/anthropic_adapter.py line 257
"timeout": Timeout(timeout=900.0, connect=10.0)
```

The 900s `timeout` is an **httpx read/write inactivity timeout** — it only fires when the kernel reports I/O readiness. A CLOSE-WAIT socket that received FIN but has no buffered data does **not** become readable via epoll until a read is attempted. Since httpx uses async I/O via `epoll_wait`, the timeout timer never starts, and the process blocks forever.

`_cleanup_dead_connections()` (line 4245) correctly probes sockets with `MSG_PEEK | MSG_DONTWAIT` and would detect this — but it only runs at the **start** of `run_conversation()` (line 7817), not during an in-flight API call. Once blocked inside `epoll_wait`, no cleanup code can intervene.

## Proposed Fix

Enable TCP keepalives on the httpx transport so the kernel sends keepalive probes on idle connections. When the remote is gone, the probes fail and the kernel closes the socket, waking `epoll_wait` with an error:

```python
import socket
import httpx

transport = httpx.HTTPTransport(
    socket_options=[
        (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
        (socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 30),   # start probes after 30s idle
        (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 10),  # probe every 10s
        (socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3),     # 3 failed probes = dead
    ]
)
client = OpenAI(..., http_client=httpx.Client(transport=transport))
```

This gives a ~60s worst-case detection window (30 + 3×10) instead of infinite hang. The existing `error_classifier.py` disconnect patterns would then fire correctly.

## Environment

- OS: Linux (Ubuntu)
- Provider: custom OpenAI-compatible (LiteLLM proxy on `127.0.0.1:4000`)
- `api_mode`: `chat_completions`
- Hermes commit: `10494b42`
- httpx version: see `pip show httpx`
- Python: 3.x

## Workaround

Kill hung processes manually:
```bash
kill -9 <pid>
```

The existing `_cleanup_dead_connections()` pre-turn check prevents the *next* message from hanging on a stale socket, but cannot rescue a process already blocked in `epoll_wait`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent hangs indefinitely on CLOSE-WAIT when custom provider drops connection mid-response #10324

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

Proposed Fix

Environment

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Agent hangs indefinitely on CLOSE-WAIT when custom provider drops connection mid-response #10324

Description

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

Proposed Fix

Environment

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions