Skip to content

fix(stream): rebuild Anthropic client on stream retry, not the OpenAI client#44076

Open
AIalliAI wants to merge 1 commit into
NousResearch:mainfrom
AIalliAI:fix/stream-retry-anthropic-pool
Open

fix(stream): rebuild Anthropic client on stream retry, not the OpenAI client#44076
AIalliAI wants to merge 1 commit into
NousResearch:mainfrom
AIalliAI:fix/stream-retry-anthropic-pool

Conversation

@AIalliAI

Copy link
Copy Markdown
Contributor

Fixes #44006

Problem

When a provider stream drops mid-generation, the retry path in interruptible_streaming_api_call unconditionally calls _replace_primary_openai_client(reason="stream_retry_pool_cleanup") to purge dead connections before retrying.

In anthropic_messages mode this is wrong on both counts:

  1. The rebuild always fails. Hermes routes MiniMax (and other third-party Anthropic-protocol providers) through the Anthropic SDK; in this mode agent.client is None and agent._client_kwargs == {} by design. The rebuild constructs OpenAI(**{}), which raises exactly the error in the issue report: The api_key client option must be set .... This is why the failure log says the rebuilt client "has no api_key" — there was never an OpenAI client (or kwargs) to rebuild in this mode.

  2. The client that actually carried the stream is never cleaned. The dropped connection lives in agent._anthropic_client's httpx pool, where the retry can pick it right back up and time out again — losing the generation. The per-attempt _try_refresh_anthropic_client_credentials call doesn't cover this either: it returns early for non-native providers (provider != "anthropic") and only rebuilds when the token actually rotated.

Fix

  • Add AIAgent._replace_primary_anthropic_client(reason=...) — the Anthropic-protocol analog of _replace_primary_openai_client: rebuilds the shared Anthropic client with the same credentials/base_url/timeout, swapping in a fresh connection pool, and closes the old client. On rebuild failure the old client is kept and False is returned (mirrors the OpenAI variant).
  • Make both stream-retry cleanup sites (stream_retry_pool_cleanup and stream_mid_tool_retry_pool_cleanup) mode-aware: rebuild the Anthropic client in anthropic_messages mode, the shared OpenAI client otherwise.

OpenAI-wire (chat_completions) behavior is unchanged.

Tests

  • New TestReplacePrimaryAnthropicClient covering: rebuild uses the same credentials and closes the old client; failed rebuild keeps the old client; no-op outside anthropic_messages mode; no-op when no Anthropic client exists.
  • Updated test_anthropic_stream_parser_valueerror_retries_before_delivery (which reproduces the issue's exact setup — MiniMax via api.minimax.io/anthropic) to assert the retry now purges the Anthropic pool and never attempts the doomed OpenAI rebuild.
  • tests/run_agent/: 1624 passed, 3 skipped. The two test_provider_parity.py::TestAuxiliaryClientProviderPriority failures in the full-suite run are pre-existing on main (order-dependent; both pass in isolation on main and on this branch).

🤖 Generated with Claude Code

The stream-retry handler in interruptible_streaming_api_call
unconditionally called _replace_primary_openai_client to purge dead
pooled connections before retrying. In anthropic_messages mode
(MiniMax, Alibaba, and native Anthropic all stream over the Anthropic
SDK) that rebuild is a guaranteed failure: agent.client is None and
_client_kwargs is {} by design, so the OpenAI constructor raises
"The api_key client option must be set" on every retry.

Worse, the client that actually carried the dropped stream —
_anthropic_client — kept its poisoned httpx pool, so the retry could
pick the dead connection right back up and time out again, losing the
generation. _try_refresh_anthropic_client_credentials doesn't cover
this: it returns early for non-native providers and only rebuilds when
the token rotated.

Add _replace_primary_anthropic_client (same credentials, fresh pool)
and make both retry-cleanup sites mode-aware.

Fixes NousResearch#44006

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@liuhao1024

Copy link
Copy Markdown
Contributor

Verification: reviewed diff — clean fix.

The new _replace_primary_anthropic_client() method mirrors the existing _replace_primary_openai_client() pattern correctly:

  • Guards on api_mode == "anthropic_messages" and non-None _anthropic_client
  • Rebuilds via build_anthropic_client() with the same credentials (key + base_url + timeout)
  • Closes old client after successful swap; keeps old client on build failure
  • All exceptions caught and logged

The two call sites in chat_completion_helpers.py (mid-tool retry at L2338 and stream retry at L2396) now branch on api_mode to rebuild the correct client. Previously both paths unconditionally called _replace_primary_openai_client(), which in anthropic_messages mode has empty _client_kwargs and silently fails — leaving the dead connection in the Anthropic httpx pool for the retry to hit again.

Test updates are well-aligned:

  • test_anthropic_stream_parser_valueerror_retries_before_delivery now asserts the Anthropic mock is called (not the OpenAI mock)
  • New TestReplacePrimaryAnthropicClient class covers: rebuild with same credentials, keeps old client on build failure, no-op outside anthropic mode, no-op without anthropic client

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder provider/anthropic Anthropic native Messages API labels Jun 11, 2026
@AIalliAI

Copy link
Copy Markdown
Contributor Author

Requesting maintainer review — this is ready to land from my side. Standalone fork CI is pending first-run approval here; the rollup branch in #44061 carrying this session's batch is fully green on upstream CI (all test shards, typecheck, e2e).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists provider/anthropic Anthropic native Messages API type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stream-retry rebuilds OpenAI client without api_key, drops the retry

3 participants