Skip to content

fix(error-classifier): route 429 'overloaded' messages to FailoverReason.overloaded (#15297)#15414

Closed
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/error-classifier-429-overloaded
Closed

fix(error-classifier): route 429 'overloaded' messages to FailoverReason.overloaded (#15297)#15414
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/error-classifier-429-overloaded

Conversation

@briandevans

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes `#15297`. Some providers — notably Z.AI / Zhipu — return HTTP 429 with messages like `"The service may be temporarily overloaded, please try again later"` to signal server-wide overload, not per-credential rate limiting.

The two conditions need different recovery strategies:

Reason Correct strategy
`rate_limit` Retry same credential once (the limit may have reset), then rotate
`overloaded` Skip retry, rotate immediately (the whole endpoint is saturated)

Before this fix:

  • `error_classifier.py` mapped every 429 to `FailoverReason.rate_limit` regardless of message body.
  • `FailoverReason.overloaded` already existed as an enum value (and was produced for 503/529) but no production path emitted it for 429.
  • `_recover_with_credential_pool` had no handler for `overloaded` — an `overloaded` classification fell through to the default no-op `return False, has_retried_429` line.

Net effect: every overload-language 429 burned a `has_retried_429` slot on the same saturated credential before the rotation happened. Cron jobs (one turn each) often used their entire execution on that wasted retry.

Fix — two narrow, additive changes

1. `agent/error_classifier.py` — disambiguate 429s on message body

New `_OVERLOADED_PATTERNS` list of provider-language overload phrases:

```python
_OVERLOADED_PATTERNS = [
"temporarily overloaded",
"server is overloaded",
"server overloaded",
"service overloaded",
"service is overloaded",
"service may be temporarily overloaded",
"upstream overloaded",
"is overloaded, please try again",
"at capacity",
"over capacity",
"currently overloaded",
]
```

The 429 branch now checks the error body and emits `FailoverReason.overloaded` when matched, preserving `rate_limit` for everything else. Phrases are deliberately narrow and provider-language-flavoured so normal rate-limit messages (`"you have been rate-limited"`, `"Too Many Requests"`) don't hit this bucket.

2. `run_agent.py::_recover_with_credential_pool` — handle `overloaded`

New branch with the same shape as the existing `billing` handler: rotate immediately via `mark_exhausted_and_rotate(...)`, no retry-on-same-credential first. The 503/529 `overloaded` classifications produced by the existing code now also flow through this branch as a side benefit.

Related Issue

Fixes #15297

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✅ Tests (adding or improving test coverage)

Test plan

  • 15 new tests in 2 files — all green on py3.11 venv (145 total in the affected test files)
  • Regression guards verified: temporarily reverted the classifier's overload branch; 10 of the 11 new classifier tests correctly failed with clear assertion messages pointing at the regressed invariant. Restored fix → all 145 pass.
  • Pre-existing tests still green:
    • `test_error_classifier.py` — 102/102 (was 95, +7 new)
    • `test_credential_pool_routing.py` — 13/13 (was 9, +4 new)

Test coverage detail

Classifier (`tests/agent/test_error_classifier.py` — 7 new):

  • `test_429_zai_temporarily_overloaded` — exact Error classifier: 429 'temporarily overloaded' misclassified as rate_limit — triggers wrong recovery path #15297 Z.AI message
  • `test_429_overload_phrases_route_to_overloaded` — 9-phrase parametrised matrix (server/service/upstream overloaded, at/over capacity, `currently overloaded`, `is overloaded, please try again`, …)
  • `test_429_plain_rate_limit_remains_rate_limit` — "Rate limit reached for requests per minute" 429 → still `rate_limit` (regression guard against disambiguation silently broadening)
  • `test_429_too_many_requests_remains_rate_limit` — literal HTTP 429 reason phrase → still `rate_limit`
  • `test_503_overloaded_path_unchanged` — sanity check that the existing 503 → `overloaded` path didn't regress

Pool handler (`tests/agent/test_credential_pool_routing.py` — 4 new):

  • `test_overloaded_rotates_immediately_on_first_failure` — the fix
  • `test_overloaded_does_not_block_on_has_retried_flag` — overloaded rotates regardless of retry-flag state
  • `test_overloaded_pool_exhaustion_returns_false` — single-entry pool returns `recovered=False` so outer fallback takes over
  • `test_rate_limit_still_uses_retry_first` — negative control: rate_limit must KEEP its retry-first semantics, otherwise normal per-key throttles would double-rotate and burn the pool

Not in scope

Copilot AI review requested due to automatic review settings April 24, 2026 23:40
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 24, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes misclassification of certain HTTP 429 responses that actually indicate provider-wide overload (not per-credential throttling), so the agent rotates credentials immediately instead of wasting the single “retry same credential” slot.

Changes:

  • Add overload phrase detection for HTTP 429s in agent/error_classifier.py, routing matched messages to FailoverReason.overloaded (otherwise keep rate_limit).
  • Add a FailoverReason.overloaded handling branch in run_agent.py::_recover_with_credential_pool to rotate immediately (no retry-first).
  • Add regression tests covering 429 overload vs rate-limit disambiguation and credential-pool rotation semantics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
agent/error_classifier.py Adds _OVERLOADED_PATTERNS and message-based disambiguation for HTTP 429 overload vs rate limiting.
run_agent.py Implements immediate credential rotation when FailoverReason.overloaded is classified.
tests/agent/test_error_classifier.py Adds classifier tests for overload-language 429s and guards to keep normal 429s as rate_limit.
tests/agent/test_credential_pool_routing.py Adds routing tests asserting overloaded rotates immediately while rate_limit retains retry-first behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread run_agent.py
Comment on lines +5600 to +5604
logger.info(
"Credential %s (provider overloaded) — rotated to pool entry %s",
rotate_status,
getattr(next_entry, "id", "?"),
)

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message says "Credential %s" but the value being logged is rotate_status (an HTTP status code). This makes the log misleading during incident debugging; consider either logging the current credential ID (if available) or changing the format string to refer to the status code instead. (Same issue exists in the nearby billing/rate_limit branches.)

Copilot uses AI. Check for mistakes.
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #14055 and #14038 — same root cause: error_classifier.py classifies 429 overloaded responses as rate_limit. This PR is a broader variant that also touches run_agent.py recovery path. #14055 is the more narrowly scoped fix for the same classifier bug.

@briandevans

Copy link
Copy Markdown
Contributor Author

Thanks @copilot — fair catch. Pushed b0d5d63d:

You're right that "Credential %s (billing) — rotated to pool entry %s" reads as if "credential 429" was the thing rotated, which is meaningless and actively confusing in an incident timeline. Drive-by fixed all four branches (billing, rate_limit, the new overloaded, and auth-refresh-failed) with the same shape:

HTTP %s (reason) — rotated credential to pool entry %s

So a real log line now reads:

HTTP 429 (provider overloaded) — rotated credential to pool entry cred-2

Which parses correctly as "got HTTP 429, classified as provider overloaded, rotated to credential cred-2". No behaviour change; 145/145 tests still pass.

@briandevans

Copy link
Copy Markdown
Contributor Author

Re @alt-glitch's cross-reference to #14055 — positioning vs that narrower PR:

This PR (#15414) #14055 (ms-alan)
Files agent/error_classifier.py + run_agent.py agent/error_classifier.py only
Classifier 429 + overload-language → overloaded Same
Pool handler New overloaded branch in _recover_with_credential_pool (rotates immediately, like billing) None — overloaded falls through to default no-op
Tests 15 new (classifier matrix + pool handler matrix + regression guards) 0

Why the handler matters: FailoverReason.overloaded already exists as an enum (and is produced for 503/529), but _recover_with_credential_pool had no handler for it — an overloaded classification reached the default return False, has_retried_429 line and effectively did nothing. So the classifier-only fix would mark these errors as overloaded but the recovery path still wouldn't rotate immediately on the first failure (it would still wait for has_retried_429 if no handler matched, and even after that fall through to fallback rather than pool rotation). I confirmed this by reading the code at run_agent.py line 5563-5628 before writing the handler.

Happy to defer to #14055 if maintainers want the narrower change first and treat the handler as a separate follow-up — just speak up. Otherwise this PR ships both halves of the fix in one reviewable diff.

briandevans and others added 2 commits April 29, 2026 07:11
…son.overloaded (NousResearch#15297)

Some providers — notably Z.AI / Zhipu — return HTTP 429 with messages
like "The service may be temporarily overloaded, please try again later"
to signal **server-wide overload**, not per-credential rate limiting.
The two conditions need different recovery strategies:

| Reason | Correct strategy |
|---|---|
| ``rate_limit`` | Retry same credential once (the limit may have reset), then rotate |
| ``overloaded`` | Skip retry, rotate immediately (the whole endpoint is saturated) |

Before this fix:

- ``error_classifier.py`` mapped every 429 to ``FailoverReason.rate_limit``
  regardless of message body.
- ``FailoverReason.overloaded`` already existed as an enum value (and was
  produced for 503/529) but no production path emitted it for 429.
- ``_recover_with_credential_pool`` had no handler for ``overloaded`` —
  an ``overloaded`` classification fell through to the default no-op
  ``return False, has_retried_429`` line.

Net effect: every overload-language 429 burned a ``has_retried_429``
slot on the same saturated credential before the rotation happened, and
cron jobs (one turn each) often used their entire execution on that
wasted retry.

Two narrow, additive changes:

1. ``agent/error_classifier.py`` — new ``_OVERLOADED_PATTERNS`` list
   containing provider-language overload phrases (``"temporarily
   overloaded"``, ``"server is overloaded"``, ``"at capacity"``, …).
   The 429 branch now checks the error body against this list and
   emits ``FailoverReason.overloaded`` when matched, preserving
   ``rate_limit`` for everything else.  Kept phrases narrow and
   provider-language-flavoured so a normal rate-limit message
   ("you have been rate-limited") doesn't hit this bucket.

2. ``run_agent.py::_recover_with_credential_pool`` — new branch for
   ``effective_reason == FailoverReason.overloaded``.  Same shape as
   the existing ``billing`` handler: rotate immediately via
   ``mark_exhausted_and_rotate(...)``, no retry-on-same-credential
   first.  The 503/529 ``overloaded`` classifications produced by the
   existing code now also flow through this branch.

``tests/agent/test_error_classifier.py`` — 7 cases:

- The exact NousResearch#15297 Z.AI message → ``overloaded``
- 9-phrase parametrised matrix (server/service/upstream overloaded,
  at/over capacity, "currently overloaded", …) → all ``overloaded``
- Plain "Rate limit reached for requests per minute" 429 → still
  ``rate_limit`` (regression guard against the disambiguation
  silently broadening)
- Plain "Too Many Requests" 429 → still ``rate_limit``
- 503 path unchanged → still ``overloaded`` (negative control)

``tests/agent/test_credential_pool_routing.py`` —
``TestPoolOverloadedRotation`` (4 cases):

- Overloaded on first failure → rotates immediately (the fix)
- Overloaded with ``has_retried_429=True`` → still rotates (no retry-
  flag dependence — the rate-limit gate is specific to ``rate_limit``)
- Single-entry pool overload → ``recovered=False`` so outer fallback
  takes over rather than spinning
- ``rate_limit`` on first failure → still uses retry-first path
  (regression guard against the new branch broadening)

**Verified regression guards**: temporarily reverted the classifier's
overload branch; 10 of the 11 new classifier tests correctly failed
with clear assertion messages.  Restored fix → all 145 tests pass
(``test_error_classifier.py`` + ``test_credential_pool_routing.py``,
existing + new).

Closes NousResearch#15297

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntial %s" (Copilot NousResearch#15414)

Copilot's review on NousResearch#15414 flagged that the rotation log lines read
``"Credential %s (billing) — rotated to pool entry %s"`` where the
first ``%s`` is bound to ``rotate_status`` (an HTTP status code, e.g.
429), not a credential ID.  Reading the literal log line —
``"Credential 429 (billing) — rotated to pool entry cred-2"`` — would
suggest "credential 429" was rotated, which is meaningless and
actively confusing during incident debugging.

Same issue existed on the three sibling branches (billing,
rate_limit, auth refresh-failed) plus the new overloaded branch this
PR adds.  Drive-by fix all four with the same ``HTTP %s (reason) —
rotated credential to pool entry %s`` shape so:

  HTTP 429 (provider overloaded) — rotated credential to pool entry cred-2

reads correctly as "we got HTTP 429, classified it as provider
overloaded, rotated to credential cred-2".

No behaviour change.  145/145 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@briandevans briandevans force-pushed the fix/error-classifier-429-overloaded branch from b0d5d63 to 0698e90 Compare April 29, 2026 14:11
@briandevans

Copy link
Copy Markdown
Contributor Author

Rebased onto current origin/main (810d98e89) — the previous base was 739 commits behind. One conflict in agent/error_classifier.py: main added _IMAGE_TOO_LARGE_PATTERNS immediately after _PAYLOAD_TOO_LARGE_PATTERNS, in the same band where this PR adds _OVERLOADED_PATTERNS. Resolved by keeping both pattern lists side-by-side — the additions are independent, no semantic overlap.

Re-ran the focused suite under pytest-xdist after the rebase:

  • tests/agent/test_error_classifier.py — 67 passed
  • tests/agent/test_credential_pool_routing.py — 78 passed
  • 145 total, no regressions

Force-pushed with --force-with-lease to briandevans:fix/error-classifier-429-overloaded. Mergeability should refresh from DIRTY to clean once GitHub recomputes.

@briandevans

Copy link
Copy Markdown
Contributor Author

Closing to keep the queue clean — happy to reopen if this is still useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error classifier: 429 'temporarily overloaded' misclassified as rate_limit — triggers wrong recovery path

3 participants