Skip to content

fix(agent): reset _fallback_index at turn start even when no fallback activated#27185

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-ec72c975
May 17, 2026
Merged

fix(agent): reset _fallback_index at turn start even when no fallback activated#27185
teknium1 merged 1 commit into
mainfrom
hermes/hermes-ec72c975

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Salvage of #20793 onto current main. Preserves @konsisumer's authorship.

Summary

Interactive CLI sessions now honor fallback_providers on Codex 429 usage_limit_reached (and all other failures), matching cron behavior.

Root cause: _try_activate_fallback() increments _fallback_index BEFORE resolving the provider's client. When every chain entry's resolver returns None (or raises), the recursive walk exhausts _fallback_index to >= len(_fallback_chain) but never sets _fallback_activated = True. Next turn, _restore_primary_runtime() early-returns because _fallback_activated is False, so the chain index stays exhausted forever. The eager-fallback check at the top of the retry loop sees the exhausted index and silently skips — no "trying fallback" log line, no status message, just 3 retries and "API call failed after 3 retries."

Cron jobs work because each cron run constructs a fresh AIAgent with _fallback_index = 0.

Changes

  • run_agent.py: add self._fallback_index = 0 in the not _fallback_activated early-return branch of _restore_primary_runtime().
  • tests/run_agent/test_primary_runtime_restore.py: regression test exercising the failed-activation path.

Validation

  • scripts/run_tests.sh tests/run_agent/test_primary_runtime_restore.py tests/run_agent/test_provider_fallback.py -q → 54/54 passed.

Closes #20465. Original PR #20793.

… activated

In long-lived interactive sessions, _try_activate_fallback() advances
_fallback_index before attempting client resolution.  When resolution
fails (provider not configured, etc.) the function returns False without
ever setting _fallback_activated=True.  _restore_primary_runtime() then
skips its reset block entirely (guarded by `if not _fallback_activated`),
leaving _fallback_index >= len(_fallback_chain) for all subsequent turns.
The eager-fallback guard at the top of the retry loop checks
`_fallback_index < len(_fallback_chain)`, so the condition fails silently
and no fallback is ever attempted again for that session.

Cron jobs spawn a fresh AIAgent per run and never hit this path, which is
why the same fallback chain works reliably for cron but not interactive.

Fix: reset _fallback_index=0 in the `not _fallback_activated` early-return
branch so every new turn starts with the full chain available.

Fixes #20465
@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-ec72c975 vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 8334 on HEAD, 8334 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4359 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@teknium1 teknium1 merged commit 33528b4 into main May 17, 2026
17 of 18 checks passed
@teknium1 teknium1 deleted the hermes/hermes-ec72c975 branch May 17, 2026 00:12
@cardtest15-coder

This comment was marked as spam.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Interactive CLI session does not auto-fallback on Codex 429 'usage_limit_reached', while cron jobs with the same fallback chain do

4 participants