Skip to content

[Bug]: auxiliary_client (Vision auto-detect / title generation) hangs indefinitely with no timeout when provider stalls; CLI input blocks #15194

@ideep40

Description

@ideep40

Summary

When the main provider experiences a transient ReadTimeout, the main turn reconnects cleanly (⚠️ Connection to provider dropped (ReadTimeout). Reconnecting… 🔄 Reconnected — resuming…), but a concurrent agent.auxiliary_client call (e.g. Vision auto-detect or title generation) never completes and never times out. The CLI timer keeps counting, the input area remains locked, and the only recovery is kill -TERM on the agent process. No error is written to logs — the last line is the auxiliary client's "using main provider" INFO log, followed by complete silence.

This appears distinct from #13617 (approval prompt — closed), #8340 (terminal tool setsid+disown hang — closed), and #13033 (Linux paste freeze — open).

Environment

  • OS: macOS (Apple Silicon)
  • Hermes: installed via pipx / user-local (Python 3.11)
  • Provider: Anthropic (claude-opus-4-6)
  • MCP servers loaded: github, atlassian, one internal stdio MCP server, plus code-execution
  • Session in progress for ~38 min, ~328K/1M context used at time of hang

Reproduction (observed in the wild, not yet minimized)

  1. Start a long-running session with multiple MCP tool calls interspersed with model turns.
  2. At some point the main provider emits a ReadTimeout; the UI shows Reconnecting… (attempt 2/3) then Reconnected — resuming….
  3. After the reconnect, the next auxiliary call fires — in my case Vision auto-detect (also reproduces with title_generation per auxiliary_client logs).
  4. agent.log shows the "using main provider" INFO line for the auxiliary call, then nothing for 25+ minutes.
  5. The CLI timer continues to count, the prompt is rendered but accepts no keyboard input (Ctrl+C, Ctrl+D, Enter all ignored).
  6. ps shows the agent process alive at ~0% CPU in S state — blocked on an await.

Last log evidence

~/.hermes/logs/agent.log (final lines before 25+ min of silence):

2026-XX-XX 19:32:31,460 INFO agent.auxiliary_client: Auxiliary auto-detect: using main provider anthropic (claude-opus-4-6)
2026-XX-XX 19:32:31,460 INFO agent.auxiliary_client: Auxiliary title_generation: using auto (claude-opus-4-6) at https://api.anthropic.com
2026-XX-XX 19:39:41,513 INFO agent.auxiliary_client: Vision auto-detect: using main provider anthropic (claude-opus-4-6)
<nothing for 25+ minutes>

No entry in errors.log at the hang timestamp. No broken-pipe or exception trace tied to the hang (there was an unrelated BrokenPipeError earlier in mcp-stderr.log, but it preceded the hang by hours and shouldn't be the cause).

Suspected root cause

agent/auxiliary_client.py (or equivalent) appears to issue the HTTP request to the auxiliary provider without a timeout= kwarg on the httpx / anthropic SDK call, and without wrapping the call in asyncio.wait_for(...) or a circuit breaker. When the underlying connection stalls (possibly because the socket pool is in a half-dead state post-reconnect of the main client), the coroutine awaits forever. There is no retry-with-timeout path on this branch, unlike the main turn's reconnect logic.

Related candidate: the auxiliary client may be reusing the HTTPX client / pool that the main provider just reconnected, inheriting a stale connection that then silently hangs on the first auxiliary request.

Expected behavior

  1. All auxiliary calls (Vision auto-detect, title_generation, etc.) should have an explicit request timeout (e.g. 30s) applied via the SDK timeout= parameter or asyncio.wait_for.
  2. On timeout, the auxiliary call should:
    • log WARNING with the auxiliary kind (vision / title / etc.) and the elapsed time,
    • fail open (e.g. skip vision hint, use default title) rather than blocking the main turn,
    • not propagate into the CLI input lock.
  3. Main turn should never be gated on auxiliary completion. If the current architecture makes them coupled, add a deadline to the wrapping task group.

Workaround (for other users hitting this)

None at runtime — once hung, the CLI cannot be recovered without kill -TERM <pid> on the agent process. Before restarting, back up ~/.hermes/sessions/session_<id>.json and, if using --resume, archive / remove the stuck session file first to avoid a resume loop (see #7536).

Additional notes

  • Found a related-but-separate issue worth fixing: tools.checkpoint_manager fails repeatedly on git add -A with index.lock: File exists after a prior kill, and never self-heals. Resulted in ~6 hours of silently lost checkpoints on my end. Happy to open a separate issue for that.
  • Would be useful to emit a WARNING when an auxiliary call exceeds some threshold even if not cancelled, so users can diagnose stalls without reading source.

What I can provide on request

  • Redacted agent.log window around the hang
  • Process state snapshot (ps, py-spy dump if needed next time I reproduce)
  • Session JSON structure (without conversation content)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/agentCore agent loop, run_agent.py, prompt buildercomp/cliCLI entry point, hermes_cli/, setup wizardtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions