Skip to content

fix(agent): reset fallback index when activation never succeeded (#16677)#18156

Closed
season179 wants to merge 1 commit into
NousResearch:mainfrom
season179:fix/fallback-index-leak-16677
Closed

fix(agent): reset fallback index when activation never succeeded (#16677)#18156
season179 wants to merge 1 commit into
NousResearch:mainfrom
season179:fix/fallback-index-leak-16677

Conversation

@season179

Copy link
Copy Markdown
Contributor

Closes #16677.

_try_activate_fallback() increments _fallback_index before activation can fail. When every entry fails to resolve (exhausted credential pool, unconfigured provider, etc.), the recursion walks the index to len(chain) without ever flipping _fallback_activated. _restore_primary_runtime() then early-returned without resetting the index, so the next turn's _try_activate_fallback() short-circuits at the bounds check — the user sees ⚠️ trying fallback... followed by an immediate abort, with no swap on the wire.

The fix is one line: reset _fallback_index = 0 in the early-return branch so the chain is attempt-able again next turn.

I reproduced the bug deterministically against the real run_agent.py code paths (single-entry chain, mocked resolve_provider_client returning (None, None) to simulate the exhausted-pool case): turn 1 leaves _fallback_index = 1 / _fallback_activated = False, the next turn's _restore_primary_runtime() doesn't reset, and even a now-resolvable fallback short-circuits at the bounds check. With the patch, the same script swaps to the fallback on turn 2 as expected.

Tested:

  • New regression test test_resets_fallback_index_after_failed_activation in tests/run_agent/test_primary_runtime_restore.py. Fails without the patch (third assertion), passes with it.
  • Full tests/run_agent/ suite: 1182 passed, 9 skipped, no regressions.

Note: this only fixes the credential-pool-exhausted / unconfigured-fallback class of #16677. The reporter also describes a separate auxiliary.vision.provider: auto resolving to a 16K-context model — that path doesn't apply the MINIMUM_CONTEXT_LENGTH floor and deserves its own ticket. Their status=75/TEMPFAIL "crash loop" framing is downstream of this same fallback-never-fires bug (75 is the intentional graceful-restart code in gateway/restart.py, not a 429 exit).

…sResearch#16677)

_try_activate_fallback() advances _fallback_index before activation can
fail. If every entry fails to resolve (exhausted credential pool,
unconfigured provider) the recursion walks the index to len(chain)
without ever flipping _fallback_activated. _restore_primary_runtime()
then early-returned without resetting the index, so subsequent fallback
attempts short-circuited at the bounds check on line 7438 -- the caller
emitted "trying fallback..." then aborted immediately.

Reset _fallback_index in the early-return branch so the chain stays
attempt-able across turns.
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder labels May 1, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #17824 — same root cause: _fallback_index not reset in _restore_primary_runtime() when previous turn exhausted chain without activating.

1 similar comment
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #17824 — same root cause: _fallback_index not reset in _restore_primary_runtime() when previous turn exhausted chain without activating.

@season179

Copy link
Copy Markdown
Contributor Author

Closing as a duplicate of #17824. Apologies for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DeepSeek V4 Pro via OpenRouter causes gateway crash loop and Telegram bot failure

2 participants