Skip to content

Anthropic streaming: stale/retry paths call _replace_primary_openai_client, causing 15-min hang on stuck streams #28161

@bautrey

Description

@bautrey

Summary

When api_mode == "anthropic_messages", three streaming-cleanup paths in agent/chat_completion_helpers.py unconditionally rebuild the OpenAI primary client via agent._replace_primary_openai_client(...). For Anthropic-native users this is wrong on both counts:

  1. The OpenAI rebuild fails with Missing credentials. Please pass an api_key, ... or set the OPENAI_API_KEY environment variable. because OPENAI_API_KEY is unset on Anthropic-only configurations.
  2. The in-flight Anthropic httpx stream is never closed, so the worker thread iterating messages.stream(...) keeps blocking on the dead socket until the 900s httpx read-timeout fires.

User-visible symptom (running Telegram → claude-opus-4-7): the agent appears to hang ~15 minutes before any retry or fallback engages.

Affected lines (current agent/chat_completion_helpers.py)

  • ~L1775 — reason="stream_mid_tool_retry_pool_cleanup"
  • ~L1833 — reason="stream_retry_pool_cleanup"
  • ~L1977 — reason="stale_stream_pool_cleanup"

All three branches dispatch to OpenAI cleanup regardless of agent.api_mode. The _interrupt_requested branch in the same function already does the right thing for Anthropic — it calls agent._anthropic_client.close() followed by agent._rebuild_anthropic_client(). The three cleanup sites just need to mirror that pattern.

Evidence from a real `errors.log` (timestamps redacted)

```
WARNING run_agent: Stream stale for 180s (threshold 180s) — no chunks received. model=claude-opus-4-7 context=~13,191 tokens. Killing connection.
WARNING run_agent: Failed to rebuild shared OpenAI client (stale_stream_pool_cleanup) thread=asyncio_1:... provider=anthropic base_url=https://api.anthropic.com model=claude-opus-4-7 error=Missing credentials. Please pass an `api_key`, ... or set the `OPENAI_API_KEY` or `OPENAI_ADMIN_KEY` environment variable.
```

And the eventual unblock (only when the 900s httpx read-timeout finally fires):

```
WARNING run_agent: Stream drop on attempt 2/3 — retrying. ... provider=anthropic base_url=https://api.anthropic.com error_type=ReadTimeout error=The read operation timed out chain=ReadTimeout(...) <- TimeoutError(...) http_status=200 bytes=0 chunks=0 elapsed=930.56s ttfb=- upstream=[cf-ray=... cf-cache-status=DYNAMIC server=cloudflare]
```

`bytes=0 chunks=0 elapsed=930.56s` is the smoking gun — the connection was held open for ~15 minutes with zero data flow because nothing was closing it.

Regression test currently encodes the bug

`tests/run_agent/test_streaming.py::test_anthropic_stream_parser_valueerror_retries_before_delivery` (and possibly siblings) currently asserts `mock_replace.call_count == 1` for the Anthropic path — i.e. the test passes precisely because the buggy OpenAI rebuild is invoked. This means the upstream test suite is green while the bug is live in production. Worth re-pointing this test at the Anthropic close+rebuild path as part of the fix.

Proposed fix (drop-in)

At each of the three sites in `agent/chat_completion_helpers.py`, branch on `api_mode`:

```python
if agent.api_mode == "anthropic_messages":
try:
agent._anthropic_client.close()
except Exception:
pass
try:
agent._rebuild_anthropic_client()
except Exception:
pass
else:
agent._replace_primary_openai_client(
reason="stale_stream_pool_cleanup" # or stream_retry_pool_cleanup / stream_mid_tool_retry_pool_cleanup
)
```

This mirrors the existing `_interrupt_requested` branch verbatim.

Local verification

Applied the fix at all three sites + corrected the regression test; full streaming-related test suite (72 tests across `test_anthropic_error_handling`, `test_interrupt_propagation`, `test_openai_client_lifecycle`, `test_streaming`, `test_stream_drop_logging`, `test_stream_interrupt_retry`) passes. Telegram + `claude-opus-4-7` agent no longer hangs after the fix.

Happy to open a PR if useful — let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/agentCore agent loop, run_agent.py, prompt builderprovider/anthropicAnthropic native Messages APItype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions