Skip to content

fix(codex): credential pool rotation on Responses API soft failures#24173

Open
jmmaloney4 wants to merge 1 commit into
NousResearch:mainfrom
jmmaloney4:fix/codex-soft-failure-pool-rotation
Open

fix(codex): credential pool rotation on Responses API soft failures#24173
jmmaloney4 wants to merge 1 commit into
NousResearch:mainfrom
jmmaloney4:fix/codex-soft-failure-pool-rotation

Conversation

@jmmaloney4

Copy link
Copy Markdown

Summary

When the Codex Responses API returns HTTP 200 with response.status = "failed" or "cancelled" (e.g. quota exhaustion), the runtime correctly detects the soft failure but only attempts cross-provider fallback via _try_activate_fallback(). Same-provider credential pool rotation is never invoked, so the agent stays on one exhausted account and eventually falls through to a different provider entirely.

Root cause: The response_invalid block (added in PR #15104) detects codex soft failures but never calls _recover_with_credential_pool(). The exception handler path (429/402) does call pool recovery, but these soft failures bypass the exception handler entirely — the HTTP response is 200, not an error code.

Changes

  • New method _classify_codex_soft_failure() on AIAgent — pattern-matches the response.error message against billing and rate-limit signals, returning a FailoverReason or None for non-quota errors.
  • Early pool-recovery block in the response_invalid path for codex_responses mode: classify the error, attempt credential rotation, and continue on the new credential before falling through to cross-provider fallback.
  • Billing signals checked first: insufficient, credits have been exhausted, billing, payment required, exceeded your current quota, plan does not include, account is deactivated, usage limit, credit balance, top up your credits, quota.
  • Rate-limit signals checked second: rate limit, rate_limit, too many requests, throttl, try again, retry, slow down, quota exceeded, requests per, tokens per.
  • Non-quota soft failures (content policy, safety, cancelled by user) return None and do not trigger pool rotation.

Test plan

  • 25 new tests in tests/agent/test_codex_soft_failure_pool_rotation.py:
    • 8 parametrized billing signal tests
    • 6 parametrized rate-limit signal tests
    • 4 non-quota error tests (content policy, safety, cancelled, empty)
    • 1 billing-priority-over-rate-limit test
    • 6 integration tests (pool rotation on billing, rate-limit, no-pool skip, content-policy skip, non-codex skip, pool exhaustion fallback)
  • All 51 existing credential pool tests still pass
  • pytest tests/agent/test_credential_pool.py tests/agent/test_credential_pool_routing.py tests/agent/test_codex_soft_failure_pool_rotation.py — 76/76 passed

Risks

  • Low risk: Changes are additive — no existing code paths are modified, only new recovery logic inserted before the existing fallback path.
  • Pattern matching is best-effort: Error messages from the Codex Responses API are free-form strings. The pattern lists cover known error phrasing but may need extension as new error messages are observed. The fallback path (cross-provider) still fires if pool recovery fails or returns None.
  • Billing vs rate-limit ambiguity: Some messages (e.g. "exceeded your quota") could be either billing or rate-limit. The classifier errs on the side of billing (immediate rotation, no backoff wait) since quota exhaustion is more severe than temporary throttling. This matches the behavior users expect when they have multiple paid plans.

Related

@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder provider/openai OpenAI / Codex Responses API area/auth Authentication, OAuth, credential pools P2 Medium — degraded but workaround exists labels May 12, 2026
When the Codex Responses API returns HTTP 200 with response.status =
"failed" or "cancelled" (e.g. quota exhaustion), the runtime correctly
detects the soft failure but only attempts cross-provider fallback via
_try_activate_fallback().  Same-provider credential pool rotation is
never invoked, so the agent burns through all pool entries on one
exhausted account before falling through to a different provider.

Root cause: the response_invalid block (PR NousResearch#15104) added detection for
codex soft failures but did not wire in _recover_with_credential_pool().
The exception handler path (429/402) does call pool recovery, but these
soft failures bypass the exception handler entirely.

Changes:
- Add _classify_codex_soft_failure() method to AIAgent that pattern-
  matches the error message from response.error against billing and
  rate-limit signals, returning a FailoverReason or None.
- Add early pool-recovery block in the response_invalid path for
  codex_responses mode: classify the error, attempt rotation, and
  continue on the new credential before falling through to cross-
  provider fallback.
- Billing signals are checked first (insufficient, credits, quota,
  billing, payment required, usage limit, account deactivated).
- Rate-limit signals checked second (rate limit, throttle, retry,
  too many requests, quota exceeded, requests/tokens per).
- Non-quota soft failures (content policy, safety, cancelled) return
  None and do not trigger pool rotation.

Fixes NousResearch#24159
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/auth Authentication, OAuth, credential pools comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists provider/openai OpenAI / Codex Responses API type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Codex Responses API soft failures (status=failed) bypass credential pool rotation

2 participants