Anthropic streaming: stale/retry paths call _replace_primary_openai_client, causing 15-min hang on stuck streams

## Summary

When `api_mode == "anthropic_messages"`, three streaming-cleanup paths in `agent/chat_completion_helpers.py` unconditionally rebuild the OpenAI primary client via `agent._replace_primary_openai_client(...)`. For Anthropic-native users this is wrong on both counts:

1. The OpenAI rebuild **fails** with `Missing credentials. Please pass an api_key, ... or set the OPENAI_API_KEY environment variable.` because `OPENAI_API_KEY` is unset on Anthropic-only configurations.
2. The in-flight Anthropic httpx stream is **never closed**, so the worker thread iterating `messages.stream(...)` keeps blocking on the dead socket until the 900s httpx read-timeout fires.

User-visible symptom (running Telegram → `claude-opus-4-7`): the agent appears to hang ~15 minutes before any retry or fallback engages.

## Affected lines (current `agent/chat_completion_helpers.py`)

- ~L1775 — `reason="stream_mid_tool_retry_pool_cleanup"`
- ~L1833 — `reason="stream_retry_pool_cleanup"`
- ~L1977 — `reason="stale_stream_pool_cleanup"`

All three branches dispatch to OpenAI cleanup regardless of `agent.api_mode`. The `_interrupt_requested` branch in the same function already does the right thing for Anthropic — it calls `agent._anthropic_client.close()` followed by `agent._rebuild_anthropic_client()`. The three cleanup sites just need to mirror that pattern.

## Evidence from a real \`errors.log\` (timestamps redacted)

\`\`\`
WARNING run_agent: Stream stale for 180s (threshold 180s) — no chunks received. model=claude-opus-4-7 context=~13,191 tokens. Killing connection.
WARNING run_agent: Failed to rebuild shared OpenAI client (stale_stream_pool_cleanup) thread=asyncio_1:... provider=anthropic base_url=https://api.anthropic.com model=claude-opus-4-7 error=Missing credentials. Please pass an \`api_key\`, ... or set the \`OPENAI_API_KEY\` or \`OPENAI_ADMIN_KEY\` environment variable.
\`\`\`

And the eventual unblock (only when the 900s httpx read-timeout finally fires):

\`\`\`
WARNING run_agent: Stream drop on attempt 2/3 — retrying. ... provider=anthropic base_url=https://api.anthropic.com error_type=ReadTimeout error=The read operation timed out chain=ReadTimeout(...) <- TimeoutError(...) http_status=200 bytes=0 chunks=0 elapsed=930.56s ttfb=- upstream=[cf-ray=... cf-cache-status=DYNAMIC server=cloudflare]
\`\`\`

\`bytes=0 chunks=0 elapsed=930.56s\` is the smoking gun — the connection was held open for ~15 minutes with zero data flow because nothing was closing it.

## Regression test currently encodes the bug

\`tests/run_agent/test_streaming.py::test_anthropic_stream_parser_valueerror_retries_before_delivery\` (and possibly siblings) currently asserts \`mock_replace.call_count == 1\` for the Anthropic path — i.e. the test passes precisely because the buggy OpenAI rebuild is invoked. This means the upstream test suite is green while the bug is live in production. Worth re-pointing this test at the Anthropic close+rebuild path as part of the fix.

## Proposed fix (drop-in)

At each of the three sites in \`agent/chat_completion_helpers.py\`, branch on \`api_mode\`:

\`\`\`python
if agent.api_mode == "anthropic_messages":
    try:
        agent._anthropic_client.close()
    except Exception:
        pass
    try:
        agent._rebuild_anthropic_client()
    except Exception:
        pass
else:
    agent._replace_primary_openai_client(
        reason="stale_stream_pool_cleanup"  # or stream_retry_pool_cleanup / stream_mid_tool_retry_pool_cleanup
    )
\`\`\`

This mirrors the existing \`_interrupt_requested\` branch verbatim.

## Local verification

Applied the fix at all three sites + corrected the regression test; full streaming-related test suite (72 tests across \`test_anthropic_error_handling\`, \`test_interrupt_propagation\`, \`test_openai_client_lifecycle\`, \`test_streaming\`, \`test_stream_drop_logging\`, \`test_stream_interrupt_retry\`) passes. Telegram + \`claude-opus-4-7\` agent no longer hangs after the fix.

Happy to open a PR if useful — let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anthropic streaming: stale/retry paths call _replace_primary_openai_client, causing 15-min hang on stuck streams #28161

Summary

Affected lines (current `agent/chat_completion_helpers.py`)

Evidence from a real `errors.log` (timestamps redacted)

Regression test currently encodes the bug

Proposed fix (drop-in)

Local verification

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Anthropic streaming: stale/retry paths call _replace_primary_openai_client, causing 15-min hang on stuck streams #28161

Description

Summary

Affected lines (current agent/chat_completion_helpers.py)

Evidence from a real `errors.log` (timestamps redacted)

Regression test currently encodes the bug

Proposed fix (drop-in)

Local verification

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Affected lines (current `agent/chat_completion_helpers.py`)