Skip to content

single-key auth errors should fail fast instead of retrying max_retries times #30331

@hi4659062-coder

Description

@hi4659062-coder

Problem

When a user has only one API key configured (no credential pool with rotation), an auth error (HTTP 401/403) triggers the full retry loop instead of failing fast.

Current behavior

  1. API call fails with 401
  2. error_classifier.py classifies as auth (retryable=True)
  3. _recover_with_credential_pool() has no backup keys to rotate to → returns False
  4. Falls back to generic retry logic: jittered_backoff(5s → 120s) × max_retries
  5. Each retry hits the same single expired/deprecated key → same 401
  6. All retries are wasted — API calls + backoff delay, then eventually gives up

Expected behavior

When there is only one credential in the pool (or no credential pool at all), an auth error should:

  • Try a single fresh connection (in case it was a transient auth hiccup)
  • If it fails again → immediately classify as auth_permanent (retryable=False)
  • Surface a clear error message: "Your API key appears to be invalid/expired. Try hermes config set provider.api_key ... or add a backup key via hermes auth add."

Why it matters

  • Wastes API calls on dead keys (up to max_retries invocations × backoff)
  • Delays user getting actionable error feedback
  • In gateway mode, the 5-120s backoff per retry makes the agent appear hung

Affected users

Anyone using a single API key who runs into an expired/revoked key. This is the most common deployment pattern (single .env key, no credential pool).

Proposed solution

In conversation_loop.py, after calling _recover_with_credential_pool() returns False:

# If credential pool had nothing to rotate to, treat auth as permanent
if not recovered_with_pool and classified.is_auth:
    if not transient_auth_retry_attempted:
        transient_auth_retry_attempted = True
        continue  # One retry with a fresh connection to rule out transient
    # Second failure → permanent
    classified = ClassifiedError(
        reason=FailoverReason.auth_permanent,
        retryable=False,
        message="API key appears invalid after retry — check your .env or run `hermes auth add`",
    )
    # Fall through to the retryable=False abort path below

This adds at most one extra API call (fresh connection), then immediately surfaces the error.

Alternative considered

  • Check pool size upfront: Also valid, but requires knowing the pool cardinality at the error-site, which is one more coupling point.

Reported by a user who hit this with an expired DeepSeek API key. The system correctly identified the 401 but burned multiple retries before giving up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existsarea/authAuthentication, OAuth, credential poolscomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions