Skip to content

Credential pool rotation should not count toward api_max_retries #16830

@aiagent0619

Description

@aiagent0619

Issue Description

When using a credential pool with multiple API keys under a single provider, the current retry logic causes Hermes to give up after trying only 2-3 keys, even when more keys are available in the pool.

Current Behavior

With the default api_max_retries: 3 and a pool of 10 API keys:

  1. API key Terminal tool #1 gets 429 → has_retried_429 set to True (retry_count = 1)
  2. API key Terminal tool #1 gets 429 again → rotates to key Support passing morph snapshot id #2, has_retried_429 resets (retry_count = 2)
  3. API key Support passing morph snapshot id #2 gets 429 → has_retried_429 set to True (retry_count = 3)
  4. Agent stops because retry_count >= max_retries, even though keys Architecture planning #3-Cluster failure tracking #10 were never tried

The relevant code is in run_agent.py:

# Line 5809-5822: _recover_with_credential_pool
if effective_reason == FailoverReason.rate_limit:
    if not has_retried_429:
        return False, True  # First 429: retry same credential
    rotate_status = status_code if status_code is not None else 429
    next_entry = pool.mark_exhausted_and_rotate(...)
    if next_entry is not None:
        self._swap_credential(next_entry)
        return True, False  # Rotated successfully

# Line 11315: retry_count is incremented regardless of rotation
retry_count += 1

Expected Behavior

Credential pool rotation should be transparent to the retry counter:

  1. Rotating to a new API key in the pool should not increment retry_count
  2. Only when all keys in the pool have been exhausted should it count as one retry attempt
  3. This allows the agent to fully utilize the credential pool before giving up

Proposed Solution

Option A (Preferred): Track pool exhaustion separately from API retries

  • Add a flag like pool_fully_exhausted that becomes True only when mark_exhausted_and_rotate() returns None
  • Only increment retry_count when pool_fully_exhausted is True

Option B: Reset retry_count on successful rotation

  • When _recover_with_credential_pool returns True (successfully rotated), don't increment retry_count
  • Only increment when recovery fails (no more keys in pool)

Use Case

Users with multiple API keys from the same provider (e.g., 10 keys with individual rate limits) expect Hermes to cycle through all available keys before reporting a rate limit error. The current behavior wastes 70% of available capacity.

Environment

  • Hermes Agent version: Latest (as of April 2026)
  • Provider: Any provider with credential pool support
  • Config: agent.api_max_retries: 3 (default), credential pool with 10+ keys

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existsarea/authAuthentication, OAuth, credential poolscomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions