Skip to content

fix(streaming): rebuild Anthropic client on stream cleanup instead of OpenAI client#28240

Open
EloquentBrush0x wants to merge 1 commit into
NousResearch:mainfrom
EloquentBrush0x:fix/issue-28161-anthropic-stream-pool-cleanup
Open

fix(streaming): rebuild Anthropic client on stream cleanup instead of OpenAI client#28240
EloquentBrush0x wants to merge 1 commit into
NousResearch:mainfrom
EloquentBrush0x:fix/issue-28161-anthropic-stream-pool-cleanup

Conversation

@EloquentBrush0x

Copy link
Copy Markdown
Contributor

What does this PR do?

Root cause: interruptible_streaming_api_call() has three connection-pool cleanup sites that called _replace_primary_openai_client() unconditionally — at the mid-tool retry path (~L1774), the transient-error retry path (~L1833), and the stale-stream outer-poll loop (~L1977).

For api_mode=anthropic_messages this has two consequences:

  1. _replace_primary_openai_client() silently fails (no OPENAI_API_KEY on Anthropic-only configs), so the dead connection pool is never purged before the next retry.
  2. The stale-stream detector's outer-poll site (~L1977) is the only mechanism that can interrupt the worker thread while it blocks in for event in stream:. Because neither the Anthropic client is closed nor the stream's underlying transport dropped, the thread stays blocked until the 900 s httpx read-timeout fires — producing the 15-minute hang reported in gateways running claude-opus-4-7.

Fix: Mirror the existing interrupt-path pattern at L1989–1997 (already correct) at all three cleanup sites. For api_mode == "anthropic_messages", call _anthropic_client.close() + _rebuild_anthropic_client() instead of _replace_primary_openai_client(). _rebuild_anthropic_client() handles both direct Anthropic and Bedrock-hosted Claude correctly.

Open PR #14430 covers only the outer stale-detector site and does not use _rebuild_anthropic_client() (Bedrock not handled). Open PR #23678 covers only the two inner retry sites, leaving the stale-stream hang unaddressed. This PR covers all three sites.

Related Issue

Fixes #28161

Type of Change

  • 🐛 Bug fix

Changes Made

  • agent/chat_completion_helpers.py: guard all three _replace_primary_openai_client() call sites with api_mode != "anthropic_messages"; add _anthropic_client.close() + _rebuild_anthropic_client() branch for Anthropic mode (+18 lines)
  • tests/run_agent/test_28161_anthropic_stream_pool_cleanup.py: two new tests — stream retry cleanup and stale-stream detector cleanup for Anthropic mode

How to Test

pytest tests/run_agent/test_28161_anthropic_stream_pool_cleanup.py -v --override-ini="addopts="

Checklist

  • Contributing Guide read | Conventional Commits | No duplicate PR covering all 3 sites
  • Single logical fix | Tests added | Platform: macOS
  • Docs — N/A | Cross-platform — N/A

… OpenAI client

interruptible_streaming_api_call() has three connection-pool cleanup
sites that called _replace_primary_openai_client() unconditionally.
For api_mode=anthropic_messages this has two consequences:

1. _replace_primary_openai_client() fails (OPENAI_API_KEY unset on
   Anthropic-only configs), so dead connections are never purged.
2. The stale-stream detector's outer-poll site (L1977) is the only
   mechanism that can interrupt the worker thread while it blocks in
   for event in stream:. Because the Anthropic client is never closed,
   the thread stays blocked until the 900 s httpx read-timeout fires,
   producing a visible 15-minute hang for Telegram/gateway users on
   claude-opus-4-7.

Fix: mirror the existing interrupt-path pattern (L1989-1997) at all
three cleanup sites — if api_mode == "anthropic_messages", call
_anthropic_client.close() + _rebuild_anthropic_client() instead of
_replace_primary_openai_client(). _rebuild_anthropic_client() handles
both direct Anthropic and Bedrock-hosted Claude correctly, unlike the
inline build_anthropic_client() calls in open PR NousResearch#14430.

PR NousResearch#14430 (open) covers only the outer stale-detector site (L1977).
PR NousResearch#23678 (open) covers only the inner retry sites (L1774, L1833).
This PR covers all three sites and uses _rebuild_anthropic_client()
for Bedrock parity.

Fixes NousResearch#28161
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder provider/anthropic Anthropic native Messages API labels May 18, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Supersedes #14430 (outer stale-detector site only, no Bedrock) and #23678 (two inner retry sites only, stale-stream hang unaddressed). This PR covers all three cleanup sites. Fixes #28161.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround provider/anthropic Anthropic native Messages API type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Anthropic streaming: stale/retry paths call _replace_primary_openai_client, causing 15-min hang on stuck streams

2 participants