Skip to content

"overloaded" server errors classified as rate_limit, exhausting credential pool #14038

@seniaxls

Description

@seniaxls

Bug Description

When a provider (e.g., Z.AI) returns a "temporarily overloaded" error (HTTP 200 with code 1305), Hermes classifies it as rate_limit with should_rotate_credential=True. This causes the credential pool to mark the API key as "exhausted" after just 2 errors, making all further retries useless.

Steps to Reproduce

  1. Configure a provider that occasionally returns overloaded errors (e.g., Z.AI with a single API key)
  2. Trigger multiple requests during peak load
  3. Provider returns: HTTP 200: The service may be temporarily overloaded, please try again later
  4. After 2 errors, the single API key is marked exhausted
  5. All subsequent retries fail immediately with no valid credential

Expected Behavior

"Overloaded" errors should be classified as server-side issues (FailoverReason.overloaded), NOT as rate limits. The credential is valid — the server is just busy. Rotating credentials is counterproductive and exhausts the pool unnecessarily.

Suggested Fix

In agent/error_classifier.py, add an overloaded check before the rate_limit check in both _classify_by_message functions:

# Overloaded patterns — server-side overload, NOT a credential/billing issue.
# Must come before rate_limit check to avoid rotating credentials unnecessarily.
if "overloaded" in error_msg or "temporarily overloaded" in error_msg:
    return result_fn(
        FailoverReason.overloaded,
        retryable=True,
    )

# Rate limit patterns
if any(p in error_msg for p in _RATE_LIMIT_PATTERNS):
    ...

Also add overloaded patterns to _RATE_LIMIT_PATTERNS:

    "servicequotaexceededexception",
    "overloaded",
    "temporarily overloaded",
]

Related

Retry parameters (max_retries, base_delay, max_delay) are hardcoded in run_agent.py. Making them configurable via config.yaml would help users tune retry behavior for providers with volatile availability without editing source code (changes are lost on hermes update).

Environment

  • Provider: Z.AI (api.z.ai) GLM Coding Max Plan
  • Error: HTTP 200 with code 1305 "The service may be temporarily overloaded, please try again later"

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/agentCore agent loop, run_agent.py, prompt builderprovider/zaiZAI providertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions