Skip to content

[Bug] Codex Responses API soft failures (status=failed) bypass credential pool rotation #24159

@jmmaloney4

Description

@jmmaloney4

Summary

When the OpenAI Codex Responses API returns quota exhaustion as HTTP 200 with response.status = "failed", the runtime correctly detects the soft failure (via the response_invalid block added in #15104) but routes recovery exclusively through _try_activate_fallback() (cross-provider). The response_invalid block never calls _recover_with_credential_pool() or mark_exhausted_and_rotate(), so same-provider credential pool rotation is dead for this error path.

Environment

  • Hermes Agent v0.13.0 (2026.5.7)
  • Provider: openai-codex (3 OAuth accounts in credential pool, round_robin strategy)
  • Python 3.13, OpenAI SDK 2.31.0

Reproduction

  1. Configure multiple openai-codex OAuth accounts in the credential pool (hermes auth add openai-codex --type oauth --label <name>)
  2. Set credential_pool_strategies.openai-codex: round_robin (or any strategy)
  3. Drain the quota on the first pool entry (priority 0)
  4. Run any hermes session using openai-codex
  5. Observe: the runtime retries the same exhausted account 3 times with backoff, then falls through to cross-provider fallback (if configured) or gives up. It never rotates to the next pool entry.

Root cause

Two error-handling paths exist in run_agent.py:

Path 1: HTTP Exception Handler (pool rotation WORKS)

Lines ~13170+ in the except block around the API call loop.

When OpenAI SDK throws APIStatusError (HTTP 429, 402, etc.), the handler calls:

  1. classify_api_error()FailoverReason
  2. _recover_with_credential_pool(reason)mark_exhausted_and_rotate() → selects next pool entry
  3. Retries with new credentials

Path 2: Invalid Response Handler (pool rotation DOES NOT FIRE)

Lines ~12335-12600 in the response_invalid block.

When the Responses API returns HTTP 200 with response.status = "failed" (quota exhaustion), the handler:

  1. Detects _codex_resp_status in {"failed", "cancelled"} (correctly, added in fix(codex): consolidated OAuth error parsing + failed-status fallback routing + reauth UX #15104)
  2. Sets response_invalid = True
  3. Calls _try_activate_fallback() — cross-provider only
  4. If no fallback: retries with exponential backoff on the same credentials
  5. After max_retries (default 3): gives up

Zero calls to _recover_with_credential_pool() or mark_exhausted_and_rotate() in the entire response_invalid block. The pool entry is never marked exhausted, so subsequent sessions may also select the same dead entry.

This grep confirms:

$ grep -c "_recover_with_credential_pool\|mark_exhausted" <lines 12335-12600>
0

Why the Responses API does this

The OpenAI Responses API returns quota exhaustion as a "soft failure":

  • HTTP 200 (success from SDK perspective)
  • response.status = "failed" (not "completed")
  • response.error contains quota/rate-limit message
  • response.output is empty

The OpenAI SDK does not throw an exception for this. The error is only detectable by inspecting response.status and response.error. This means Path 1 never fires for Codex quota exhaustion — only Path 2.

Proposed fix

In the response_invalid block for codex_responses, when response.status in {"failed", "cancelled"} and the error message contains quota/billing/rate-limit signals, call self._recover_with_credential_pool(reason) with an appropriate FailoverReason before falling through to _try_activate_fallback().

Pseudocode for the insertion point around line 12362:

# After setting response_invalid = True and error_details:
if _codex_resp_status in {"failed", "cancelled"}:
    # Classify the error for pool recovery
    error_reason = self._classify_codex_soft_failure(_codex_error_msg)
    if error_reason:
        recovered, _ = self._recover_with_credential_pool(error_reason)
        if recovered:
            retry_count = 0
            compression_attempts = 0
            primary_recovery_attempted = False
            continue  # retry with new credentials

The error classification logic from agent/error_classifier.py could be reused to detect quota signals from response.error.message and response.error.code. If the error is not quota-related (e.g. content policy), pool rotation should not fire.

Workaround

  1. Start a fresh session (/new or restart hermes) — the pool will select a different entry on cold start
  2. If all entries are exhausted, manually reset: hermes auth reset openai-codex
  3. If using fill_first, manually swap priority 0: hermes auth remove openai-codex 1 && hermes auth add openai-codex --type oauth --label <name>

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existsarea/authAuthentication, OAuth, credential poolscomp/agentCore agent loop, run_agent.py, prompt builderprovider/openaiOpenAI / Codex Responses APItype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions