Summary
When api_mode == "anthropic_messages", three streaming-cleanup paths in agent/chat_completion_helpers.py unconditionally rebuild the OpenAI primary client via agent._replace_primary_openai_client(...). For Anthropic-native users this is wrong on both counts:
- The OpenAI rebuild fails with
Missing credentials. Please pass an api_key, ... or set the OPENAI_API_KEY environment variable. because OPENAI_API_KEY is unset on Anthropic-only configurations.
- The in-flight Anthropic httpx stream is never closed, so the worker thread iterating
messages.stream(...) keeps blocking on the dead socket until the 900s httpx read-timeout fires.
User-visible symptom (running Telegram → claude-opus-4-7): the agent appears to hang ~15 minutes before any retry or fallback engages.
Affected lines (current agent/chat_completion_helpers.py)
- ~L1775 —
reason="stream_mid_tool_retry_pool_cleanup"
- ~L1833 —
reason="stream_retry_pool_cleanup"
- ~L1977 —
reason="stale_stream_pool_cleanup"
All three branches dispatch to OpenAI cleanup regardless of agent.api_mode. The _interrupt_requested branch in the same function already does the right thing for Anthropic — it calls agent._anthropic_client.close() followed by agent._rebuild_anthropic_client(). The three cleanup sites just need to mirror that pattern.
Evidence from a real `errors.log` (timestamps redacted)
```
WARNING run_agent: Stream stale for 180s (threshold 180s) — no chunks received. model=claude-opus-4-7 context=~13,191 tokens. Killing connection.
WARNING run_agent: Failed to rebuild shared OpenAI client (stale_stream_pool_cleanup) thread=asyncio_1:... provider=anthropic base_url=https://api.anthropic.com model=claude-opus-4-7 error=Missing credentials. Please pass an `api_key`, ... or set the `OPENAI_API_KEY` or `OPENAI_ADMIN_KEY` environment variable.
```
And the eventual unblock (only when the 900s httpx read-timeout finally fires):
```
WARNING run_agent: Stream drop on attempt 2/3 — retrying. ... provider=anthropic base_url=https://api.anthropic.com error_type=ReadTimeout error=The read operation timed out chain=ReadTimeout(...) <- TimeoutError(...) http_status=200 bytes=0 chunks=0 elapsed=930.56s ttfb=- upstream=[cf-ray=... cf-cache-status=DYNAMIC server=cloudflare]
```
`bytes=0 chunks=0 elapsed=930.56s` is the smoking gun — the connection was held open for ~15 minutes with zero data flow because nothing was closing it.
Regression test currently encodes the bug
`tests/run_agent/test_streaming.py::test_anthropic_stream_parser_valueerror_retries_before_delivery` (and possibly siblings) currently asserts `mock_replace.call_count == 1` for the Anthropic path — i.e. the test passes precisely because the buggy OpenAI rebuild is invoked. This means the upstream test suite is green while the bug is live in production. Worth re-pointing this test at the Anthropic close+rebuild path as part of the fix.
Proposed fix (drop-in)
At each of the three sites in `agent/chat_completion_helpers.py`, branch on `api_mode`:
```python
if agent.api_mode == "anthropic_messages":
try:
agent._anthropic_client.close()
except Exception:
pass
try:
agent._rebuild_anthropic_client()
except Exception:
pass
else:
agent._replace_primary_openai_client(
reason="stale_stream_pool_cleanup" # or stream_retry_pool_cleanup / stream_mid_tool_retry_pool_cleanup
)
```
This mirrors the existing `_interrupt_requested` branch verbatim.
Local verification
Applied the fix at all three sites + corrected the regression test; full streaming-related test suite (72 tests across `test_anthropic_error_handling`, `test_interrupt_propagation`, `test_openai_client_lifecycle`, `test_streaming`, `test_stream_drop_logging`, `test_stream_interrupt_retry`) passes. Telegram + `claude-opus-4-7` agent no longer hangs after the fix.
Happy to open a PR if useful — let me know.
Summary
When
api_mode == "anthropic_messages", three streaming-cleanup paths inagent/chat_completion_helpers.pyunconditionally rebuild the OpenAI primary client viaagent._replace_primary_openai_client(...). For Anthropic-native users this is wrong on both counts:Missing credentials. Please pass an api_key, ... or set the OPENAI_API_KEY environment variable.becauseOPENAI_API_KEYis unset on Anthropic-only configurations.messages.stream(...)keeps blocking on the dead socket until the 900s httpx read-timeout fires.User-visible symptom (running Telegram →
claude-opus-4-7): the agent appears to hang ~15 minutes before any retry or fallback engages.Affected lines (current
agent/chat_completion_helpers.py)reason="stream_mid_tool_retry_pool_cleanup"reason="stream_retry_pool_cleanup"reason="stale_stream_pool_cleanup"All three branches dispatch to OpenAI cleanup regardless of
agent.api_mode. The_interrupt_requestedbranch in the same function already does the right thing for Anthropic — it callsagent._anthropic_client.close()followed byagent._rebuild_anthropic_client(). The three cleanup sites just need to mirror that pattern.Evidence from a real `errors.log` (timestamps redacted)
```
WARNING run_agent: Stream stale for 180s (threshold 180s) — no chunks received. model=claude-opus-4-7 context=~13,191 tokens. Killing connection.
WARNING run_agent: Failed to rebuild shared OpenAI client (stale_stream_pool_cleanup) thread=asyncio_1:... provider=anthropic base_url=https://api.anthropic.com model=claude-opus-4-7 error=Missing credentials. Please pass an `api_key`, ... or set the `OPENAI_API_KEY` or `OPENAI_ADMIN_KEY` environment variable.
```
And the eventual unblock (only when the 900s httpx read-timeout finally fires):
```
WARNING run_agent: Stream drop on attempt 2/3 — retrying. ... provider=anthropic base_url=https://api.anthropic.com error_type=ReadTimeout error=The read operation timed out chain=ReadTimeout(...) <- TimeoutError(...) http_status=200 bytes=0 chunks=0 elapsed=930.56s ttfb=- upstream=[cf-ray=... cf-cache-status=DYNAMIC server=cloudflare]
```
`bytes=0 chunks=0 elapsed=930.56s` is the smoking gun — the connection was held open for ~15 minutes with zero data flow because nothing was closing it.
Regression test currently encodes the bug
`tests/run_agent/test_streaming.py::test_anthropic_stream_parser_valueerror_retries_before_delivery` (and possibly siblings) currently asserts `mock_replace.call_count == 1` for the Anthropic path — i.e. the test passes precisely because the buggy OpenAI rebuild is invoked. This means the upstream test suite is green while the bug is live in production. Worth re-pointing this test at the Anthropic close+rebuild path as part of the fix.
Proposed fix (drop-in)
At each of the three sites in `agent/chat_completion_helpers.py`, branch on `api_mode`:
```python
if agent.api_mode == "anthropic_messages":
try:
agent._anthropic_client.close()
except Exception:
pass
try:
agent._rebuild_anthropic_client()
except Exception:
pass
else:
agent._replace_primary_openai_client(
reason="stale_stream_pool_cleanup" # or stream_retry_pool_cleanup / stream_mid_tool_retry_pool_cleanup
)
```
This mirrors the existing `_interrupt_requested` branch verbatim.
Local verification
Applied the fix at all three sites + corrected the regression test; full streaming-related test suite (72 tests across `test_anthropic_error_handling`, `test_interrupt_propagation`, `test_openai_client_lifecycle`, `test_streaming`, `test_stream_drop_logging`, `test_stream_interrupt_retry`) passes. Telegram + `claude-opus-4-7` agent no longer hangs after the fix.
Happy to open a PR if useful — let me know.