fix(stream): rebuild Anthropic client on stream retry, not the OpenAI client#44076
fix(stream): rebuild Anthropic client on stream retry, not the OpenAI client#44076AIalliAI wants to merge 1 commit into
Conversation
The stream-retry handler in interruptible_streaming_api_call
unconditionally called _replace_primary_openai_client to purge dead
pooled connections before retrying. In anthropic_messages mode
(MiniMax, Alibaba, and native Anthropic all stream over the Anthropic
SDK) that rebuild is a guaranteed failure: agent.client is None and
_client_kwargs is {} by design, so the OpenAI constructor raises
"The api_key client option must be set" on every retry.
Worse, the client that actually carried the dropped stream —
_anthropic_client — kept its poisoned httpx pool, so the retry could
pick the dead connection right back up and time out again, losing the
generation. _try_refresh_anthropic_client_credentials doesn't cover
this: it returns early for non-native providers and only rebuilds when
the token rotated.
Add _replace_primary_anthropic_client (same credentials, fresh pool)
and make both retry-cleanup sites mode-aware.
Fixes NousResearch#44006
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Verification: reviewed diff — clean fix. The new
The two call sites in Test updates are well-aligned:
|
|
Requesting maintainer review — this is ready to land from my side. Standalone fork CI is pending first-run approval here; the rollup branch in #44061 carrying this session's batch is fully green on upstream CI (all test shards, typecheck, e2e). |
Fixes #44006
Problem
When a provider stream drops mid-generation, the retry path in
interruptible_streaming_api_callunconditionally calls_replace_primary_openai_client(reason="stream_retry_pool_cleanup")to purge dead connections before retrying.In
anthropic_messagesmode this is wrong on both counts:The rebuild always fails. Hermes routes MiniMax (and other third-party Anthropic-protocol providers) through the Anthropic SDK; in this mode
agent.client is Noneandagent._client_kwargs == {}by design. The rebuild constructsOpenAI(**{}), which raises exactly the error in the issue report:The api_key client option must be set .... This is why the failure log says the rebuilt client "has no api_key" — there was never an OpenAI client (or kwargs) to rebuild in this mode.The client that actually carried the stream is never cleaned. The dropped connection lives in
agent._anthropic_client's httpx pool, where the retry can pick it right back up and time out again — losing the generation. The per-attempt_try_refresh_anthropic_client_credentialscall doesn't cover this either: it returns early for non-native providers (provider != "anthropic") and only rebuilds when the token actually rotated.Fix
AIAgent._replace_primary_anthropic_client(reason=...)— the Anthropic-protocol analog of_replace_primary_openai_client: rebuilds the shared Anthropic client with the same credentials/base_url/timeout, swapping in a fresh connection pool, and closes the old client. On rebuild failure the old client is kept andFalseis returned (mirrors the OpenAI variant).stream_retry_pool_cleanupandstream_mid_tool_retry_pool_cleanup) mode-aware: rebuild the Anthropic client inanthropic_messagesmode, the shared OpenAI client otherwise.OpenAI-wire (
chat_completions) behavior is unchanged.Tests
TestReplacePrimaryAnthropicClientcovering: rebuild uses the same credentials and closes the old client; failed rebuild keeps the old client; no-op outsideanthropic_messagesmode; no-op when no Anthropic client exists.test_anthropic_stream_parser_valueerror_retries_before_delivery(which reproduces the issue's exact setup — MiniMax viaapi.minimax.io/anthropic) to assert the retry now purges the Anthropic pool and never attempts the doomed OpenAI rebuild.tests/run_agent/: 1624 passed, 3 skipped. The twotest_provider_parity.py::TestAuxiliaryClientProviderPriorityfailures in the full-suite run are pre-existing onmain(order-dependent; both pass in isolation onmainand on this branch).🤖 Generated with Claude Code