Summary
When the main provider experiences a transient ReadTimeout, the main turn reconnects cleanly (⚠️ Connection to provider dropped (ReadTimeout). Reconnecting… 🔄 Reconnected — resuming…), but a concurrent agent.auxiliary_client call (e.g. Vision auto-detect or title generation) never completes and never times out. The CLI timer keeps counting, the input area remains locked, and the only recovery is kill -TERM on the agent process. No error is written to logs — the last line is the auxiliary client's "using main provider" INFO log, followed by complete silence.
This appears distinct from #13617 (approval prompt — closed), #8340 (terminal tool setsid+disown hang — closed), and #13033 (Linux paste freeze — open).
Environment
- OS: macOS (Apple Silicon)
- Hermes: installed via pipx / user-local (Python 3.11)
- Provider: Anthropic (
claude-opus-4-6)
- MCP servers loaded:
github, atlassian, one internal stdio MCP server, plus code-execution
- Session in progress for ~38 min, ~328K/1M context used at time of hang
Reproduction (observed in the wild, not yet minimized)
- Start a long-running session with multiple MCP tool calls interspersed with model turns.
- At some point the main provider emits a
ReadTimeout; the UI shows Reconnecting… (attempt 2/3) then Reconnected — resuming….
- After the reconnect, the next auxiliary call fires — in my case
Vision auto-detect (also reproduces with title_generation per auxiliary_client logs).
agent.log shows the "using main provider" INFO line for the auxiliary call, then nothing for 25+ minutes.
- The CLI
⏲ timer continues to count, the ❯ prompt is rendered but accepts no keyboard input (Ctrl+C, Ctrl+D, Enter all ignored).
ps shows the agent process alive at ~0% CPU in S state — blocked on an await.
Last log evidence
~/.hermes/logs/agent.log (final lines before 25+ min of silence):
2026-XX-XX 19:32:31,460 INFO agent.auxiliary_client: Auxiliary auto-detect: using main provider anthropic (claude-opus-4-6)
2026-XX-XX 19:32:31,460 INFO agent.auxiliary_client: Auxiliary title_generation: using auto (claude-opus-4-6) at https://api.anthropic.com
2026-XX-XX 19:39:41,513 INFO agent.auxiliary_client: Vision auto-detect: using main provider anthropic (claude-opus-4-6)
<nothing for 25+ minutes>
No entry in errors.log at the hang timestamp. No broken-pipe or exception trace tied to the hang (there was an unrelated BrokenPipeError earlier in mcp-stderr.log, but it preceded the hang by hours and shouldn't be the cause).
Suspected root cause
agent/auxiliary_client.py (or equivalent) appears to issue the HTTP request to the auxiliary provider without a timeout= kwarg on the httpx / anthropic SDK call, and without wrapping the call in asyncio.wait_for(...) or a circuit breaker. When the underlying connection stalls (possibly because the socket pool is in a half-dead state post-reconnect of the main client), the coroutine awaits forever. There is no retry-with-timeout path on this branch, unlike the main turn's reconnect logic.
Related candidate: the auxiliary client may be reusing the HTTPX client / pool that the main provider just reconnected, inheriting a stale connection that then silently hangs on the first auxiliary request.
Expected behavior
- All auxiliary calls (
Vision auto-detect, title_generation, etc.) should have an explicit request timeout (e.g. 30s) applied via the SDK timeout= parameter or asyncio.wait_for.
- On timeout, the auxiliary call should:
- log
WARNING with the auxiliary kind (vision / title / etc.) and the elapsed time,
- fail open (e.g. skip vision hint, use default title) rather than blocking the main turn,
- not propagate into the CLI input lock.
- Main turn should never be gated on auxiliary completion. If the current architecture makes them coupled, add a deadline to the wrapping task group.
Workaround (for other users hitting this)
None at runtime — once hung, the CLI cannot be recovered without kill -TERM <pid> on the agent process. Before restarting, back up ~/.hermes/sessions/session_<id>.json and, if using --resume, archive / remove the stuck session file first to avoid a resume loop (see #7536).
Additional notes
- Found a related-but-separate issue worth fixing:
tools.checkpoint_manager fails repeatedly on git add -A with index.lock: File exists after a prior kill, and never self-heals. Resulted in ~6 hours of silently lost checkpoints on my end. Happy to open a separate issue for that.
- Would be useful to emit a
WARNING when an auxiliary call exceeds some threshold even if not cancelled, so users can diagnose stalls without reading source.
What I can provide on request
- Redacted
agent.log window around the hang
- Process state snapshot (
ps, py-spy dump if needed next time I reproduce)
- Session JSON structure (without conversation content)
Summary
When the main provider experiences a transient
ReadTimeout, the main turn reconnects cleanly (⚠️ Connection to provider dropped (ReadTimeout). Reconnecting… 🔄 Reconnected — resuming…), but a concurrentagent.auxiliary_clientcall (e.g. Vision auto-detect or title generation) never completes and never times out. The CLI timer keeps counting, the input area remains locked, and the only recovery iskill -TERMon the agent process. No error is written to logs — the last line is the auxiliary client's "using main provider" INFO log, followed by complete silence.This appears distinct from #13617 (approval prompt — closed), #8340 (terminal tool setsid+disown hang — closed), and #13033 (Linux paste freeze — open).
Environment
claude-opus-4-6)github,atlassian, one internal stdio MCP server, plus code-executionReproduction (observed in the wild, not yet minimized)
ReadTimeout; the UI showsReconnecting… (attempt 2/3)thenReconnected — resuming….Vision auto-detect(also reproduces withtitle_generationperauxiliary_clientlogs).agent.logshows the "using main provider" INFO line for the auxiliary call, then nothing for 25+ minutes.⏲timer continues to count, the❯prompt is rendered but accepts no keyboard input (Ctrl+C, Ctrl+D, Enter all ignored).psshows the agent process alive at ~0% CPU inSstate — blocked on an await.Last log evidence
~/.hermes/logs/agent.log(final lines before 25+ min of silence):No entry in
errors.logat the hang timestamp. No broken-pipe or exception trace tied to the hang (there was an unrelatedBrokenPipeErrorearlier inmcp-stderr.log, but it preceded the hang by hours and shouldn't be the cause).Suspected root cause
agent/auxiliary_client.py(or equivalent) appears to issue the HTTP request to the auxiliary provider without atimeout=kwarg on the httpx / anthropic SDK call, and without wrapping the call inasyncio.wait_for(...)or a circuit breaker. When the underlying connection stalls (possibly because the socket pool is in a half-dead state post-reconnect of the main client), the coroutine awaits forever. There is no retry-with-timeout path on this branch, unlike the main turn'sreconnectlogic.Related candidate: the auxiliary client may be reusing the HTTPX client / pool that the main provider just reconnected, inheriting a stale connection that then silently hangs on the first auxiliary request.
Expected behavior
Vision auto-detect,title_generation, etc.) should have an explicit request timeout (e.g. 30s) applied via the SDKtimeout=parameter orasyncio.wait_for.WARNINGwith the auxiliary kind (vision / title / etc.) and the elapsed time,Workaround (for other users hitting this)
None at runtime — once hung, the CLI cannot be recovered without
kill -TERM <pid>on the agent process. Before restarting, back up~/.hermes/sessions/session_<id>.jsonand, if using--resume, archive / remove the stuck session file first to avoid a resume loop (see #7536).Additional notes
tools.checkpoint_managerfails repeatedly ongit add -Awithindex.lock: File existsafter a prior kill, and never self-heals. Resulted in ~6 hours of silently lost checkpoints on my end. Happy to open a separate issue for that.WARNINGwhen an auxiliary call exceeds some threshold even if not cancelled, so users can diagnose stalls without reading source.What I can provide on request
agent.logwindow around the hangps,py-spy dumpif needed next time I reproduce)