fix: 24h cooldown for 401/403 auth failures + user notification#10058
Closed
teknium1 wants to merge 1 commit into
Closed
fix: 24h cooldown for 401/403 auth failures + user notification#10058teknium1 wants to merge 1 commit into
teknium1 wants to merge 1 commit into
Conversation
Previously, credentials exhausted due to 401 (invalid token) or 403 (forbidden) used the same 1-hour cooldown as 429 rate limits. This meant the system would retry an invalid token every hour forever — burning API calls and confusing users who had no idea why their primary provider wasn't being used. Changes: - credential_pool: EXHAUSTED_TTL_AUTH_SECONDS = 24h for 401/403 errors (rate limits keep 1h cooldown, provider reset_at timestamps still override both) - run_agent: emit actionable status message via _emit_status() when all pool credentials are rejected — tells the user to run `hermes auth reset <provider>` or `hermes model` to re-authenticate. Message propagates to both CLI (force-printed) and gateway (Telegram, Discord, etc.) - Tests for all three TTL cases (401 stays exhausted at 1h, 401 resets at 24h, 403 stays exhausted at 1h) and auth exhaustion notification (emits when pool exhausted, silent when rotation succeeds) Addresses user report: Copilot 401 + Codex 429 caused silent fallback with no recovery path visible to the user.
Contributor
Author
|
Closing — keeping the existing 1h cooldown as-is. |
aangelinsf
pushed a commit
to aangelinsf/hermes-agent
that referenced
this pull request
Apr 15, 2026
…th-exhausted When all credentials for a provider are exhausted due to 401/403 failures, emit a plain-language _emit_status() notification so gateway users (Telegram, Discord, etc.) know their primary AI has become unavailable and what to do. Same-provider key rotation remains silent — the message only fires when rotation itself fails and Hermes is forced to fall back. This is distinct from the cooldown duration change in PR NousResearch#10058 (which was closed). The notification half of that fix stands on its own: the configured fallback_model path already calls _emit_status() on provider switch, so this makes the credential pool exhaustion path consistent with that behavior. Closes NousResearch#10476
Open
19 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses a user-reported UX issue where invalid credentials (Copilot 401, Codex 429) caused silent fallback with no recovery path visible to the user.
Two changes:
1. 401/403 auth failures now have a 24-hour cooldown instead of 1 hour
Previously, credentials exhausted due to 401 (invalid token) or 403 (forbidden) used the same 1-hour cooldown as 429 rate limits. The system would retry the same dead token every hour, fail immediately, and re-exhaust — an infinite cycle. Now:
reset_at2. User-facing notification when all credentials are rejected
When all pool credentials for a provider get 401'd and the system falls back, it now emits an actionable message via
_emit_status():This propagates to both CLI (force-printed regardless of quiet mode) and gateway (Telegram, Discord, etc. via status_callback).
Files changed
agent/credential_pool.py— newEXHAUSTED_TTL_AUTH_SECONDSconstant, updated_exhausted_ttl()run_agent.py— notification in_recover_with_credential_pool()auth pathtests/agent/test_credential_pool.py— 3 new TTL tests (401 stays at 1h, 401 resets at 24h, 403 stays at 1h)tests/agent/test_credential_pool_routing.py— 2 new notification tests (emits on pool exhaustion, silent on rotation)Test results
All 5 new tests pass. 0 new failures introduced (6 pre-existing failures confirmed on unmodified main).