Skip to content

fix(agent): fail fast on single-key auth errors instead of burning max_retries#30794

Open
Carry00 wants to merge 1 commit into
NousResearch:mainfrom
Carry00:fix/30331-single-key-auth-fail-fast
Open

fix(agent): fail fast on single-key auth errors instead of burning max_retries#30794
Carry00 wants to merge 1 commit into
NousResearch:mainfrom
Carry00:fix/30331-single-key-auth-fail-fast

Conversation

@Carry00

@Carry00 Carry00 commented May 23, 2026

Copy link
Copy Markdown
Contributor

Closes #30331.

The bug

When a user has one credential in ~/.hermes/.env (no pool to rotate
to) and that key returns HTTP 401/403, the conversation loop classifies
the error as auth (retryable=True) and hits the generic backoff path:
jittered_backoff(retry_count, base_delay=5.0, max_delay=120.0) ×
max_retries. Each retry calls the same dead key and gets the same 401.

For the typical max_retries=5 configuration that's 5 + 10 + 20 + 40 + 80 ≈ 155 s of pure latency on a credential that is not going to start
working on its own — and the user sees the agent appear hung the entire
time. In gateway mode this looks identical to a wedged process.

The fix

After all provider-specific OAuth refresh paths have had their chance
(codex / nous / copilot / anthropic — each of these already gets one
shot at minting a fresh token), an auth error with recovered_with_pool == False now:

  1. First occurrence — retries once with a fresh connection so a
    genuine transient hiccup (load balancer flap, TLS reset, brief
    provider-side blip) still clears. A single visible line is logged so
    the user knows what's happening.
  2. Second occurrence — upgrades the ClassifiedError reason from
    auth to auth_permanent. The downstream is_client_error branch
    sees retryable=False and takes the existing non-retryable abort
    path, which already prints the "your API key was rejected" actionable
    hints for Codex/xAI OAuth, OpenRouter, and generic providers.

status_code / provider / model / error_context are preserved
through the upgrade so the abort path's diagnostic output stays accurate
(provider, endpoint, masked token prefix, etc.).

Why this placement

Placement after the provider-specific 401 handlers, before the generic
retry, is load-bearing:

  • It lets _try_refresh_codex_client_credentials / _try_refresh_nous_*
    / _try_refresh_copilot_* / _try_refresh_anthropic_* run first.
    Those refresh real OAuth tokens — they're the actual recovery path
    when the failure was a stale token, not a revoked key.
  • The fail-fast block only fires when those refreshers either don't
    apply (provider has no refresher — openrouter, openai, deepseek,
    generic OpenAI-compatible) or have already run without fixing the
    underlying 401.
  • It does not fire on rate-limit, billing, server-error, or
    context-overflow paths because those have is_auth = False.

Tests

tests/agent/test_single_key_auth_fail_fast.py — 9 tests covering:

  • is_auth still recognises the upgraded auth_permanent reason (so
    the abort path's actionable-hint block still fires).
  • The conversation-loop source declares the one-shot flag and the
    fail-fast block, gated on classified.is_auth and not recovered_with_pool — so a future refactor cannot silently drop the
    fix back to the old max_retries-burning behaviour.
  • The fresh-retry arm appears textually before the permanent-upgrade
    arm.
  • The upgrade preserves status_code / provider / model /
    error_context for the downstream diagnostic output.
  • The upgrade synthesises a "credential appears invalid or revoked"
    message when the original error message was empty.

Verified the surrounding test suites are unaffected:

$ venv/bin/python -m pytest \
    tests/agent/test_single_key_auth_fail_fast.py \
    tests/agent/test_credential_pool_routing.py \
    tests/agent/test_credential_pool.py \
    tests/agent/test_gemini_fast_fallback.py \
    tests/agent/test_unsupported_parameter_retry.py \
    tests/agent/test_unsupported_temperature_retry.py
======================== 113 passed in 5.34s ========================

Test plan

  • On a host with an expired DeepSeek (or any OpenAI-compatible)
    key in .env, run hermes -z "ping" and confirm:
    • One visible "Auth failure with no credential to rotate to — retrying
      once" line.
    • Total wall-clock before the abort is < 10 s (fresh retry + abort),
      not 2–8 minutes.
    • The existing actionable hint ("Check API key", "Run hermes setup",
      OpenRouter credits link, etc.) still prints.
  • With a valid key and a transient network hiccup, the agent
    still succeeds on the fresh-connection retry — i.e. we don't
    regress on the genuinely transient case.
  • With a credential pool that does have rotation room, behaviour
    is unchanged — the fail-fast block is gated on
    not recovered_with_pool and stays out of the way.

🤖 Generated with Claude Code

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder area/auth Authentication, OAuth, credential pools labels May 23, 2026
@Carry00 Carry00 force-pushed the fix/30331-single-key-auth-fail-fast branch from 8b96413 to cc6fd89 Compare May 23, 2026 09:40
…x_retries

Closes NousResearch#30331.

When the configured credential pool has nothing to rotate to (single-key
deployments — the common pattern of one provider key in ~/.hermes/.env)
and an HTTP 401/403 comes back, the existing retry loop would treat the
error as transient and hit `jittered_backoff(retry_count, base_delay=5.0,
max_delay=120.0)` × `max_retries` times. Each retry hit the same dead
key and got the same 401 — up to ~8 minutes of pure latency before the
user saw an actionable error.

The fix adds a one-shot `single_key_auth_retry_attempted` flag to the
per-turn state block. After all provider-specific OAuth refresh paths
have had their chance (codex / nous / copilot / anthropic), an auth
error with no pool rotation available now:

1. On first occurrence — retries once with a fresh connection (handles
   genuine transient hiccups), logging a single visible line so the
   user can see what's happening.
2. On the second occurrence — upgrades the ClassifiedError reason from
   `auth` to `auth_permanent`. The downstream `is_client_error` branch
   sees `retryable=False` and takes the existing non-retryable abort
   path, which already has actionable "your API key was rejected" hints
   for Codex/xAI OAuth, OpenRouter, and generic providers.

status_code / provider / model / error_context are preserved through
the upgrade so the abort path's diagnostic output stays accurate.
@Carry00 Carry00 force-pushed the fix/30331-single-key-auth-fail-fast branch from cc6fd89 to 1bb4b36 Compare May 25, 2026 19:58
@Carry00

Carry00 commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

@teknium1 Hi! Just a gentle ping on this one too — no rush at all, just wanted to make sure it wasn't buried. Happy to revise or rebase if needed. Thanks so much for maintaining this project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/auth Authentication, OAuth, credential pools comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

single-key auth errors should fail fast instead of retrying max_retries times

2 participants