fix(agent): fail fast on single-key auth errors instead of burning max_retries by Carry00 · Pull Request #30794 · NousResearch/hermes-agent

Carry00 · 2026-05-23T07:12:07Z

The bug

When a user has one credential in ~/.hermes/.env (no pool to rotate
to) and that key returns HTTP 401/403, the conversation loop classifies
the error as auth (retryable=True) and hits the generic backoff path:
jittered_backoff(retry_count, base_delay=5.0, max_delay=120.0) ×
max_retries. Each retry calls the same dead key and gets the same 401.

For the typical max_retries=5 configuration that's 5 + 10 + 20 + 40 + 80 ≈ 155 s of pure latency on a credential that is not going to start
working on its own — and the user sees the agent appear hung the entire
time. In gateway mode this looks identical to a wedged process.

The fix

After all provider-specific OAuth refresh paths have had their chance
(codex / nous / copilot / anthropic — each of these already gets one
shot at minting a fresh token), an auth error with recovered_with_pool == False now:

First occurrence — retries once with a fresh connection so a
genuine transient hiccup (load balancer flap, TLS reset, brief
provider-side blip) still clears. A single visible line is logged so
the user knows what's happening.
Second occurrence — upgrades the ClassifiedError reason from
auth to auth_permanent. The downstream is_client_error branch
sees retryable=False and takes the existing non-retryable abort
path, which already prints the "your API key was rejected" actionable
hints for Codex/xAI OAuth, OpenRouter, and generic providers.

status_code / provider / model / error_context are preserved
through the upgrade so the abort path's diagnostic output stays accurate
(provider, endpoint, masked token prefix, etc.).

Why this placement

Placement after the provider-specific 401 handlers, before the generic
retry, is load-bearing:

It lets _try_refresh_codex_client_credentials / _try_refresh_nous_*
/ _try_refresh_copilot_* / _try_refresh_anthropic_* run first.
Those refresh real OAuth tokens — they're the actual recovery path
when the failure was a stale token, not a revoked key.
The fail-fast block only fires when those refreshers either don't
apply (provider has no refresher — openrouter, openai, deepseek,
generic OpenAI-compatible) or have already run without fixing the
underlying 401.
It does not fire on rate-limit, billing, server-error, or
context-overflow paths because those have is_auth = False.

Tests

tests/agent/test_single_key_auth_fail_fast.py — 9 tests covering:

is_auth still recognises the upgraded auth_permanent reason (so
the abort path's actionable-hint block still fires).
The conversation-loop source declares the one-shot flag and the
fail-fast block, gated on classified.is_auth and not recovered_with_pool — so a future refactor cannot silently drop the
fix back to the old max_retries-burning behaviour.
The fresh-retry arm appears textually before the permanent-upgrade
arm.
The upgrade preserves status_code / provider / model /
error_context for the downstream diagnostic output.
The upgrade synthesises a "credential appears invalid or revoked"
message when the original error message was empty.

Verified the surrounding test suites are unaffected:

$ venv/bin/python -m pytest \
    tests/agent/test_single_key_auth_fail_fast.py \
    tests/agent/test_credential_pool_routing.py \
    tests/agent/test_credential_pool.py \
    tests/agent/test_gemini_fast_fallback.py \
    tests/agent/test_unsupported_parameter_retry.py \
    tests/agent/test_unsupported_temperature_retry.py
======================== 113 passed in 5.34s ========================

Test plan

On a host with an expired DeepSeek (or any OpenAI-compatible)
key in .env, run hermes -z "ping" and confirm:
- One visible "Auth failure with no credential to rotate to — retrying
  once" line.
- Total wall-clock before the abort is < 10 s (fresh retry + abort),
  not 2–8 minutes.
- The existing actionable hint ("Check API key", "Run hermes setup",
  OpenRouter credits link, etc.) still prints.
With a valid key and a transient network hiccup, the agent
still succeeds on the fresh-connection retry — i.e. we don't
regress on the genuinely transient case.
With a credential pool that does have rotation room, behaviour
is unchanged — the fail-fast block is gated on
not recovered_with_pool and stays out of the way.

🤖 Generated with Claude Code

…x_retries Closes NousResearch#30331. When the configured credential pool has nothing to rotate to (single-key deployments — the common pattern of one provider key in ~/.hermes/.env) and an HTTP 401/403 comes back, the existing retry loop would treat the error as transient and hit `jittered_backoff(retry_count, base_delay=5.0, max_delay=120.0)` × `max_retries` times. Each retry hit the same dead key and got the same 401 — up to ~8 minutes of pure latency before the user saw an actionable error. The fix adds a one-shot `single_key_auth_retry_attempted` flag to the per-turn state block. After all provider-specific OAuth refresh paths have had their chance (codex / nous / copilot / anthropic), an auth error with no pool rotation available now: 1. On first occurrence — retries once with a fresh connection (handles genuine transient hiccups), logging a single visible line so the user can see what's happening. 2. On the second occurrence — upgrades the ClassifiedError reason from `auth` to `auth_permanent`. The downstream `is_client_error` branch sees `retryable=False` and takes the existing non-retryable abort path, which already has actionable "your API key was rejected" hints for Codex/xAI OAuth, OpenRouter, and generic providers. status_code / provider / model / error_context are preserved through the upgrade so the abort path's diagnostic output stays accurate.

Carry00 · 2026-05-27T20:14:25Z

@teknium1 Hi! Just a gentle ping on this one too — no rush at all, just wanted to make sure it wasn't buried. Happy to revise or rebase if needed. Thanks so much for maintaining this project!

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder area/auth Authentication, OAuth, credential pools labels May 23, 2026

Carry00 force-pushed the fix/30331-single-key-auth-fail-fast branch from 8b96413 to cc6fd89 Compare May 23, 2026 09:40

Carry00 force-pushed the fix/30331-single-key-auth-fail-fast branch from cc6fd89 to 1bb4b36 Compare May 25, 2026 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): fail fast on single-key auth errors instead of burning max_retries#30794

fix(agent): fail fast on single-key auth errors instead of burning max_retries#30794
Carry00 wants to merge 1 commit into
NousResearch:mainfrom
Carry00:fix/30331-single-key-auth-fail-fast

Carry00 commented May 23, 2026

Uh oh!

Carry00 commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Carry00 commented May 23, 2026

The bug

The fix

Why this placement

Tests

Test plan

Uh oh!

Carry00 commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants