Skip to content

fix(streaming): handle Anthropic client in stale stream detector#14430

Open
xssssrf wants to merge 1 commit into
NousResearch:mainfrom
xssssrf:fix/streaming-stale-detector-anthropic
Open

fix(streaming): handle Anthropic client in stale stream detector#14430
xssssrf wants to merge 1 commit into
NousResearch:mainfrom
xssssrf:fix/streaming-stale-detector-anthropic

Conversation

@xssssrf

@xssssrf xssssrf commented Apr 23, 2026

Copy link
Copy Markdown

Summary

The streaming stale detector in _run_streaming_api_call() does not handle the anthropic_messages api_mode. When the stale timeout fires, the code attempts to close request_client_holder["client"] — but for Anthropic, this holder is always None (the Anthropic SDK manages its own transport internally). As a result, the stale connection is never actually closed and no fresh connection is ever established.

Root Cause

Three connection-reset sites exist in the streaming path:

Location Purpose Has Anthropic branch?
~line 5235 Non-streaming stale detector
~line 6179 Streaming stale detector Missing
~line 6199 Interrupt handler

The streaming stale detector was copied from the OpenAI path without being extended to cover Anthropic.

Symptom

When the agent is running on the anthropic_messages api_mode and a stream stalls (e.g. due to an SSE keep-alive with no real chunks), the stale detector fires and emits "⚠️ No response from provider... Reconnecting..." — but the underlying Anthropic client is never closed. The inner _call_anthropic() thread keeps waiting on the same dead stream. Each subsequent detector tick emits another status message while doing nothing, leaving the agent stuck until the stream eventually errors out on its own.

Fix

Mirror the pattern already used by the non-streaming stale detector (~line 5235) and the interrupt handler (~line 6199): branch on self.api_mode and call self._anthropic_client.close() + rebuild via build_anthropic_client() for the Anthropic path, keeping the existing OpenAI path unchanged. Guard _replace_primary_openai_client() to the non-Anthropic path only, since it operates on the OpenAI connection pool.

Testing

  • python -m py_compile run_agent.py passes
  • pytest tests/agent/ -q — 1735 passed, 1 skipped. The 1 pre-existing failure (test_minimax_provider.py::TestMinimaxSwitchModelCredentialGuard) reproduces identically on the unmodified upstream run_agent.py, confirming it is unrelated to this change.
  • Tested on macOS (Apple Silicon) with anthropic_messages api_mode via direct Anthropic API key
  • Existing OpenAI/chat-completions path behavior is unchanged

The streaming stale detector in _run_streaming_api_call() only handled
the OpenAI/chat-completions path when killing a stale connection. When
api_mode is 'anthropic_messages', request_client_holder['client'] is
always None (the Anthropic SDK uses its own internal transport), so the
existing code was a no-op: the stale Anthropic stream was never actually
closed, and no fresh connection was ever established.

This meant that after the stale detector fired, the inner _call_anthropic()
thread would keep waiting on the same dead stream indefinitely. Each
subsequent stale detector tick would emit another 'Reconnecting...' status
message but do nothing to reset the connection, causing the agent to appear
stuck for the full duration of the stale stream's lifetime.

The fix mirrors the existing pattern already used in two other locations
in the same file (the non-streaming stale detector at ~line 5235 and the
interrupt handler at ~line 6199): branch on api_mode and call
self._anthropic_client.close() + rebuild via build_anthropic_client()
for the Anthropic path, while keeping the existing OpenAI path unchanged.

The _replace_primary_openai_client() call is also guarded to only run
on the non-Anthropic path, since it operates on the OpenAI client pool
and is irrelevant when using the Anthropic SDK.
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder provider/anthropic Anthropic native Messages API labels Apr 23, 2026
@im-sham

im-sham commented Apr 26, 2026

Copy link
Copy Markdown

Thanks for tracking this down. I hit the same failure mode on a Linux gateway using anthropic_messages: Hermes repeatedly emitted No response from provider ... Reconnecting..., but the Anthropic stream was not actually closed/rebuilt, so the session stayed stuck until a harder restart/recovery path.

The root cause here matches what I saw. One suggestion before merge: it would be worth adding regression coverage for the anthropic_messages stale-stream path, since prior fixes covered adjacent reconnect sites but this branch was still able to regress. I tested a local variant with a fake blocking Anthropic stream that asserts the stale detector closes the old Anthropic client and rebuilds it.

Validation from that variant:

  • python3 -m py_compile run_agent.py tests/run_agent/test_streaming.py
  • pytest -o addopts= tests/run_agent/test_streaming.py -q -> 31 passed
  • adjacent Anthropic credential/OAuth slice -> 60 passed

Also minor implementation note: current origin/main has _rebuild_anthropic_client(), so the stale detector can call self._anthropic_client.close() followed by self._rebuild_anthropic_client() rather than importing/reconstructing the client inline. Happy to open a tiny follow-up PR with just the regression test/helper-based version if that would help.

@dbreslavets

Copy link
Copy Markdown

Same bug in prod with provider: claude-code (OAuth → anthropic_messages mode) — six 180s stalls in one day, same OPENAI_API_KEY warning.

One nit before merge: re-implementing the Anthropic rebuild inline skips two things _rebuild_anthropic_client already handles — _oauth_1m_beta_disabled carry-forward (matters for OAuth) and the Bedrock branch:

try:
    if self.api_mode == "anthropic_messages":
        try:
            self._anthropic_client.close()
        except Exception:
            pass
        self._rebuild_anthropic_client()
    else:
        rc = request_client_holder.get("client")
        if rc is not None:
            self._close_request_openai_client(rc, reason="stale_stream_kill")
except Exception:
    pass

I have a 4-case dispatch test (addresses @im-sham's coverage note) — can attach if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround provider/anthropic Anthropic native Messages API type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants