Skip to content

Error classifier: 429 'temporarily overloaded' misclassified as rate_limit — triggers wrong recovery path #15297

@mordekai-lab

Description

@mordekai-lab

Summary

The error classifier treats all HTTP 429 responses as FailoverReason.rate_limit, regardless of whether the 429 indicates a per-key rate limit or a server-side overload. This causes the wrong recovery strategy to be used.

Some providers (e.g. Z.AI/Zhipu) return HTTP 429 with messages like:

HTTP 429: The service may be temporarily overloaded, please try again later

This is a server-side overload — the entire provider endpoint is struggling, not just this API key hitting a per-key quota. The recovery strategy should be different:

Reason Correct Behavior
rate_limit Retry same credential once, then rotate to next key
overloaded Skip retry, rotate immediately (the whole provider is down)

Current Behavior

agent/error_classifier.py (line ~551):

if status_code == 429:
    return result_fn(
        FailoverReason.rate_limit,  # ← always rate_limit
        retryable=True,
        should_rotate_credential=True,
        should_fallback=True,
    )

The message body is not inspected to distinguish overload from rate limiting.

Additionally, FailoverReason.overloaded exists as an enum value but is never produced by the 429 classification path, and _handle_credential_failover() in run_agent.py has no handler for it — it falls through to the default no-op return.

Proposed Fix

  1. In error_classifier.py: inspect the error message for overload patterns ("temporarily overloaded", "server is overloaded", "capacity", etc.) and classify as FailoverReason.overloaded instead of rate_limit
  2. In run_agent.py: add an overloaded handler in _handle_credential_failover() that skips the retry-on-same-credential step and rotates immediately (same behavior as billing)

Environment

  • Hermes-agent latest
  • Observed with Z.AI provider returning HTTP 429: "The service may be temporarily overloaded, please try again later"
  • No retry_after header or resets_at field in the error response

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt builderprovider/zaiZAI providertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions