Skip to content

fix(credential-pool): correctness + rotation + cross-process sync#15120

Merged
teknium1 merged 8 commits into
mainfrom
hermes/hermes-172af8ae
Apr 24, 2026
Merged

fix(credential-pool): correctness + rotation + cross-process sync#15120
teknium1 merged 8 commits into
mainfrom
hermes/hermes-172af8ae

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Consolidated salvage of 7 credential-pool and auth-related PRs — pool strategy correctness, auth-failure recovery paths, UX, and cross-process OAuth sync. Attribution preserved via rebase-merge.

Changes

Credit

Validation

scripts/run_tests.sh tests/agent/test_credential_pool.py tests/agent/test_credential_pool_routing.py \
  tests/hermes_cli/test_auth_commands.py tests/hermes_cli/test_auth_codex_provider.py \
  tests/hermes_cli/test_auth_nous_provider.py tests/hermes_cli/test_status.py \
  tests/run_agent/test_provider_fallback.py tests/hermes_cli/test_overlay_slug_resolution.py \
  tests/run_agent/test_run_agent_codex_responses.py

208/208 passing.

Not included in this salvage

vominh1919 and others added 8 commits April 24, 2026 05:14
The least_used strategy selected entries via min(request_count) but
never incremented the counter. All entries stayed at count=0, so the
strategy degenerated to fill_first behavior with no actual load balancing.

Now increments request_count after each selection and persists the update.
Previously _handle_credential_pool_error handled 401, 402, and 429
but silently ignored 403. When a provider returns 403 for a revoked or
unauthorised credential (e.g. Nous agent_key invalidated by a newer
login), the pool was never rotated and every subsequent request
continued to use the same failing credential.

Treat 403 the same as 402: immediately mark the current credential
exhausted and rotate to the next pool entry, since a Forbidden response
will not resolve itself with a retry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extracts pool-rotation-room logic into `_pool_may_recover_from_rate_limit`
so single-credential pools no longer block the eager-fallback path on 429.

The existing check `pool is not None and pool.has_available()` lets
fallback fire only after the pool marks every entry as exhausted.  With
exactly one credential in the pool (the common shape for Gemini OAuth,
Vertex service accounts, and any personal-key setup), `has_available()`
flips back to True as soon as the cooldown expires — Hermes retries
against the same entry, hits the same daily-quota 429, and burns the
retry budget in a tight loop before ever reaching the configured
`fallback_model`.  Observed in the wild as 4+ hours of 429 noise on a
single Gemini key instead of falling through to Vertex as configured.

Rotation is only meaningful with more than one credential — gate on
`len(pool.entries()) > 1`.  Multi-credential pools keep the current
wait-for-rotation behaviour unchanged.

Fixes #11314.  Related to #8947, #10210, #7230.  Narrower scope than
open PRs #8023 (classifier change) and #11492 (503/529 credential-pool
bypass) — this addresses the single-credential 429 case specifically
and does not conflict with either.

Tests: 6 new unit tests in tests/run_agent/test_provider_fallback.py
covering (a) None pool, (b) single-cred available, (c) single-cred in
cooldown, (d) 2-cred available rotates, (e) multi-cred all cooling-down
falls back, (f) many-cred available rotates.  All 18 tests in the file
pass.
Concurrent Hermes processes (e.g. cron jobs) refreshing a Nous OAuth token
via resolve_nous_runtime_credentials() write the rotated tokens to auth.json.
The calling process's pool entry becomes stale, and the next refresh against
the already-rotated token triggers a 'refresh token reuse' revocation on
the Nous Portal.

_sync_nous_entry_from_auth_store() reads auth.json under the same lock used
by resolve_nous_runtime_credentials, and adopts the newer token pair before
refreshing the pool entry. This complements #15111 (which preserved the
obtained_at timestamps through seeding).

Partial salvage of #10160 by @konsisumer — only the agent/credential_pool.py
changes + the 3 Nous-specific regression tests. The PR also touched 10
unrelated files (Dockerfile, tips.py, various tool tests) which were
dropped as scope creep.

Regression tests:
- test_sync_nous_entry_from_auth_store_adopts_newer_tokens
- test_sync_nous_entry_noop_when_tokens_match
- test_nous_exhausted_entry_recovers_via_auth_store_sync
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/auth Authentication, OAuth, credential pools comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: 遇到http 529错误的时候,应该尝试切换成fallback的模型

9 participants