fix(error-classifier): route 429 'overloaded' messages to FailoverReason.overloaded (#15297)#15414
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes misclassification of certain HTTP 429 responses that actually indicate provider-wide overload (not per-credential throttling), so the agent rotates credentials immediately instead of wasting the single “retry same credential” slot.
Changes:
- Add overload phrase detection for HTTP 429s in
agent/error_classifier.py, routing matched messages toFailoverReason.overloaded(otherwise keeprate_limit). - Add a
FailoverReason.overloadedhandling branch inrun_agent.py::_recover_with_credential_poolto rotate immediately (no retry-first). - Add regression tests covering 429 overload vs rate-limit disambiguation and credential-pool rotation semantics.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
agent/error_classifier.py |
Adds _OVERLOADED_PATTERNS and message-based disambiguation for HTTP 429 overload vs rate limiting. |
run_agent.py |
Implements immediate credential rotation when FailoverReason.overloaded is classified. |
tests/agent/test_error_classifier.py |
Adds classifier tests for overload-language 429s and guards to keep normal 429s as rate_limit. |
tests/agent/test_credential_pool_routing.py |
Adds routing tests asserting overloaded rotates immediately while rate_limit retains retry-first behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.info( | ||
| "Credential %s (provider overloaded) — rotated to pool entry %s", | ||
| rotate_status, | ||
| getattr(next_entry, "id", "?"), | ||
| ) |
There was a problem hiding this comment.
The log message says "Credential %s" but the value being logged is rotate_status (an HTTP status code). This makes the log misleading during incident debugging; consider either logging the current credential ID (if available) or changing the format string to refer to the status code instead. (Same issue exists in the nearby billing/rate_limit branches.)
|
Thanks @copilot — fair catch. Pushed You're right that So a real log line now reads: Which parses correctly as "got HTTP 429, classified as provider overloaded, rotated to credential cred-2". No behaviour change; 145/145 tests still pass. |
|
Re @alt-glitch's cross-reference to #14055 — positioning vs that narrower PR:
Why the handler matters: Happy to defer to #14055 if maintainers want the narrower change first and treat the handler as a separate follow-up — just speak up. Otherwise this PR ships both halves of the fix in one reviewable diff. |
…son.overloaded (NousResearch#15297) Some providers — notably Z.AI / Zhipu — return HTTP 429 with messages like "The service may be temporarily overloaded, please try again later" to signal **server-wide overload**, not per-credential rate limiting. The two conditions need different recovery strategies: | Reason | Correct strategy | |---|---| | ``rate_limit`` | Retry same credential once (the limit may have reset), then rotate | | ``overloaded`` | Skip retry, rotate immediately (the whole endpoint is saturated) | Before this fix: - ``error_classifier.py`` mapped every 429 to ``FailoverReason.rate_limit`` regardless of message body. - ``FailoverReason.overloaded`` already existed as an enum value (and was produced for 503/529) but no production path emitted it for 429. - ``_recover_with_credential_pool`` had no handler for ``overloaded`` — an ``overloaded`` classification fell through to the default no-op ``return False, has_retried_429`` line. Net effect: every overload-language 429 burned a ``has_retried_429`` slot on the same saturated credential before the rotation happened, and cron jobs (one turn each) often used their entire execution on that wasted retry. Two narrow, additive changes: 1. ``agent/error_classifier.py`` — new ``_OVERLOADED_PATTERNS`` list containing provider-language overload phrases (``"temporarily overloaded"``, ``"server is overloaded"``, ``"at capacity"``, …). The 429 branch now checks the error body against this list and emits ``FailoverReason.overloaded`` when matched, preserving ``rate_limit`` for everything else. Kept phrases narrow and provider-language-flavoured so a normal rate-limit message ("you have been rate-limited") doesn't hit this bucket. 2. ``run_agent.py::_recover_with_credential_pool`` — new branch for ``effective_reason == FailoverReason.overloaded``. Same shape as the existing ``billing`` handler: rotate immediately via ``mark_exhausted_and_rotate(...)``, no retry-on-same-credential first. The 503/529 ``overloaded`` classifications produced by the existing code now also flow through this branch. ``tests/agent/test_error_classifier.py`` — 7 cases: - The exact NousResearch#15297 Z.AI message → ``overloaded`` - 9-phrase parametrised matrix (server/service/upstream overloaded, at/over capacity, "currently overloaded", …) → all ``overloaded`` - Plain "Rate limit reached for requests per minute" 429 → still ``rate_limit`` (regression guard against the disambiguation silently broadening) - Plain "Too Many Requests" 429 → still ``rate_limit`` - 503 path unchanged → still ``overloaded`` (negative control) ``tests/agent/test_credential_pool_routing.py`` — ``TestPoolOverloadedRotation`` (4 cases): - Overloaded on first failure → rotates immediately (the fix) - Overloaded with ``has_retried_429=True`` → still rotates (no retry- flag dependence — the rate-limit gate is specific to ``rate_limit``) - Single-entry pool overload → ``recovered=False`` so outer fallback takes over rather than spinning - ``rate_limit`` on first failure → still uses retry-first path (regression guard against the new branch broadening) **Verified regression guards**: temporarily reverted the classifier's overload branch; 10 of the 11 new classifier tests correctly failed with clear assertion messages. Restored fix → all 145 tests pass (``test_error_classifier.py`` + ``test_credential_pool_routing.py``, existing + new). Closes NousResearch#15297 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntial %s" (Copilot NousResearch#15414) Copilot's review on NousResearch#15414 flagged that the rotation log lines read ``"Credential %s (billing) — rotated to pool entry %s"`` where the first ``%s`` is bound to ``rotate_status`` (an HTTP status code, e.g. 429), not a credential ID. Reading the literal log line — ``"Credential 429 (billing) — rotated to pool entry cred-2"`` — would suggest "credential 429" was rotated, which is meaningless and actively confusing during incident debugging. Same issue existed on the three sibling branches (billing, rate_limit, auth refresh-failed) plus the new overloaded branch this PR adds. Drive-by fix all four with the same ``HTTP %s (reason) — rotated credential to pool entry %s`` shape so: HTTP 429 (provider overloaded) — rotated credential to pool entry cred-2 reads correctly as "we got HTTP 429, classified it as provider overloaded, rotated to credential cred-2". No behaviour change. 145/145 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b0d5d63 to
0698e90
Compare
|
Rebased onto current Re-ran the focused suite under pytest-xdist after the rebase:
Force-pushed with |
|
Closing to keep the queue clean — happy to reopen if this is still useful. |
What does this PR do?
Fixes `#15297`. Some providers — notably Z.AI / Zhipu — return HTTP 429 with messages like `"The service may be temporarily overloaded, please try again later"` to signal server-wide overload, not per-credential rate limiting.
The two conditions need different recovery strategies:
Before this fix:
Net effect: every overload-language 429 burned a `has_retried_429` slot on the same saturated credential before the rotation happened. Cron jobs (one turn each) often used their entire execution on that wasted retry.
Fix — two narrow, additive changes
1. `agent/error_classifier.py` — disambiguate 429s on message body
New `_OVERLOADED_PATTERNS` list of provider-language overload phrases:
```python
_OVERLOADED_PATTERNS = [
"temporarily overloaded",
"server is overloaded",
"server overloaded",
"service overloaded",
"service is overloaded",
"service may be temporarily overloaded",
"upstream overloaded",
"is overloaded, please try again",
"at capacity",
"over capacity",
"currently overloaded",
]
```
The 429 branch now checks the error body and emits `FailoverReason.overloaded` when matched, preserving `rate_limit` for everything else. Phrases are deliberately narrow and provider-language-flavoured so normal rate-limit messages (`"you have been rate-limited"`, `"Too Many Requests"`) don't hit this bucket.
2. `run_agent.py::_recover_with_credential_pool` — handle `overloaded`
New branch with the same shape as the existing `billing` handler: rotate immediately via `mark_exhausted_and_rotate(...)`, no retry-on-same-credential first. The 503/529 `overloaded` classifications produced by the existing code now also flow through this branch as a side benefit.
Related Issue
Fixes #15297
Type of Change
Test plan
Test coverage detail
Classifier (`tests/agent/test_error_classifier.py` — 7 new):
Pool handler (`tests/agent/test_credential_pool_routing.py` — 4 new):
Not in scope