fix(agent): fail fast on single-key auth errors instead of burning max_retries#30794
Open
Carry00 wants to merge 1 commit into
Open
fix(agent): fail fast on single-key auth errors instead of burning max_retries#30794Carry00 wants to merge 1 commit into
Carry00 wants to merge 1 commit into
Conversation
8b96413 to
cc6fd89
Compare
…x_retries Closes NousResearch#30331. When the configured credential pool has nothing to rotate to (single-key deployments — the common pattern of one provider key in ~/.hermes/.env) and an HTTP 401/403 comes back, the existing retry loop would treat the error as transient and hit `jittered_backoff(retry_count, base_delay=5.0, max_delay=120.0)` × `max_retries` times. Each retry hit the same dead key and got the same 401 — up to ~8 minutes of pure latency before the user saw an actionable error. The fix adds a one-shot `single_key_auth_retry_attempted` flag to the per-turn state block. After all provider-specific OAuth refresh paths have had their chance (codex / nous / copilot / anthropic), an auth error with no pool rotation available now: 1. On first occurrence — retries once with a fresh connection (handles genuine transient hiccups), logging a single visible line so the user can see what's happening. 2. On the second occurrence — upgrades the ClassifiedError reason from `auth` to `auth_permanent`. The downstream `is_client_error` branch sees `retryable=False` and takes the existing non-retryable abort path, which already has actionable "your API key was rejected" hints for Codex/xAI OAuth, OpenRouter, and generic providers. status_code / provider / model / error_context are preserved through the upgrade so the abort path's diagnostic output stays accurate.
cc6fd89 to
1bb4b36
Compare
Contributor
Author
|
@teknium1 Hi! Just a gentle ping on this one too — no rush at all, just wanted to make sure it wasn't buried. Happy to revise or rebase if needed. Thanks so much for maintaining this project! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #30331.
The bug
When a user has one credential in
~/.hermes/.env(no pool to rotateto) and that key returns HTTP 401/403, the conversation loop classifies
the error as
auth(retryable=True) and hits the generic backoff path:jittered_backoff(retry_count, base_delay=5.0, max_delay=120.0)×max_retries. Each retry calls the same dead key and gets the same 401.For the typical
max_retries=5configuration that's5 + 10 + 20 + 40 + 80 ≈ 155 sof pure latency on a credential that is not going to startworking on its own — and the user sees the agent appear hung the entire
time. In gateway mode this looks identical to a wedged process.
The fix
After all provider-specific OAuth refresh paths have had their chance
(codex / nous / copilot / anthropic — each of these already gets one
shot at minting a fresh token), an auth error with
recovered_with_pool == Falsenow:genuine transient hiccup (load balancer flap, TLS reset, brief
provider-side blip) still clears. A single visible line is logged so
the user knows what's happening.
ClassifiedErrorreason fromauthtoauth_permanent. The downstreamis_client_errorbranchsees
retryable=Falseand takes the existing non-retryable abortpath, which already prints the "your API key was rejected" actionable
hints for Codex/xAI OAuth, OpenRouter, and generic providers.
status_code/provider/model/error_contextare preservedthrough the upgrade so the abort path's diagnostic output stays accurate
(provider, endpoint, masked token prefix, etc.).
Why this placement
Placement after the provider-specific 401 handlers, before the generic
retry, is load-bearing:
_try_refresh_codex_client_credentials/_try_refresh_nous_*/
_try_refresh_copilot_*/_try_refresh_anthropic_*run first.Those refresh real OAuth tokens — they're the actual recovery path
when the failure was a stale token, not a revoked key.
apply (provider has no refresher — openrouter, openai, deepseek,
generic OpenAI-compatible) or have already run without fixing the
underlying 401.
context-overflow paths because those have
is_auth = False.Tests
tests/agent/test_single_key_auth_fail_fast.py— 9 tests covering:is_authstill recognises the upgradedauth_permanentreason (sothe abort path's actionable-hint block still fires).
fail-fast block, gated on
classified.is_auth and not recovered_with_pool— so a future refactor cannot silently drop thefix back to the old max_retries-burning behaviour.
arm.
status_code/provider/model/error_contextfor the downstream diagnostic output.message when the original error message was empty.
Verified the surrounding test suites are unaffected:
Test plan
key in
.env, runhermes -z "ping"and confirm:once" line.
not 2–8 minutes.
OpenRouter credits link, etc.) still prints.
still succeeds on the fresh-connection retry — i.e. we don't
regress on the genuinely transient case.
is unchanged — the fail-fast block is gated on
not recovered_with_pooland stays out of the way.🤖 Generated with Claude Code