fix(agent): consolidated fallback/retry correctness — SSL retry, cooldown, config context, auth fallback parity by teknium1 · Pull Request #15134 · NousResearch/hermes-agent

teknium1 · 2026-04-24T12:35:37Z

Summary

Consolidated salvage of 5 retry / fallback-chain correctness PRs touching the agent loop. Attribution preserved via rebase-merge. All changes preserve prompt-cache integrity (no context mutation, no mid-session reloads).

Changes

run_agent.py (SSL retry) — ssl.SSLError inherits from ValueError via Python MRO, so the is_local_validation_error = isinstance(api_error, (ValueError, TypeError)) check misclassified TLS transport failures as "local programming bug" and aborted without retrying. Added ssl.SSLError to the exclusion tuple alongside UnicodeEncodeError and json.JSONDecodeError (fix(agent): retry on json.JSONDecodeError instead of treating it as a local validation error #15107). From @Bartok9 (fix(agent): exclude ssl.SSLError from is_local_validation_error to prevent non-retryable abort #14445).
run_agent.py (fallback config context) — _try_activate_fallback() passed base_url/api_key/provider into get_model_context_length() but dropped config_context_length, so explicit model.context_length overrides in config.yaml were ignored once the fallback activated (falling back to 128K even when config said 204800). From @CruxExperts (fix(agent): pass config_context_length in fallback activation path #14727).
run_agent.py (switch_model fallback state guard) — switch_model() wrote self._fallback_chain = [pruned] only inside the provider-mismatch if branch. When called on a partially-constructed agent (via AIAgent.__new__ in tests) or when old_norm == new_norm, it left _fallback_chain unset and crashed on the next access. Now always initializes via getattr(self, "_fallback_chain", []) or [] and assigns back unconditionally. From @LeonSGP43 (fix: default missing fallback state in switch_model #14867).
agent/error_classifier.py + run_agent.py (429 + cooldown) — Two-part fix:
1. classify_api_error() force-sets status_code=429 when the error type is RateLimitError but .status_code is missing (Copilot / GitHub Models SDK quirk), so downstream rate-limit handling fires.
2. _try_activate_fallback(reason=...) sets self._rate_limited_until = time.monotonic() + 60 only when leaving the primary provider (not when chain-advancing between fallbacks). _restore_primary_runtime() now respects the cooldown and stays on the working fallback until it expires, preventing the flip-flop retry loop that burned the budget on exhausted primaries. From @vlwkaos (fix(agent): force 429 classification for RateLimitError + rate-limit cooldown on primary restore #8023).
gateway/run.py + cron/scheduler.py (auth fallback parity) — Mirrors the CLI fallback-on-AuthError path (fix(codex): consolidated OAuth error parsing + failed-status fallback routing + reauth UX #15104 / @A-FdL-Prog's fix(codex): route auth failures to fallback provider chain #5948) for the gateway and cron contexts. When resolve_runtime_provider() raises AuthError at session/job startup, walk the fallback_providers chain and switch to the first working provider instead of failing the message/job. From @Tranquil-Flow (fix(gateway,cron): activate fallback_model when primary provider auth fails #7432).

Credit

Validation

scripts/run_tests.sh tests/agent/test_error_classifier.py \
  tests/run_agent/test_primary_runtime_restore.py \
  tests/run_agent/test_switch_model_fallback_prune.py \
  tests/run_agent/test_provider_fallback.py \
  tests/gateway/test_auth_fallback.py \
  tests/run_agent/test_run_agent_codex_responses.py \
  tests/run_agent/test_jsondecodeerror_retryable.py

226/226 passing. run_agent.py, cli.py, agent/error_classifier.py, gateway/run.py, cron/scheduler.py all compile.

Conflict resolutions

fix(agent): exclude ssl.SSLError from is_local_validation_error to prevent non-retryable abort #14445 SSL exclusion — combined with main's (UnicodeEncodeError, json.JSONDecodeError) tuple from fix(agent): retry on json.JSONDecodeError instead of treating it as a local validation error #15107.
fix: default missing fallback state in switch_model #14867 fallback state — auto-merged against main's in-place swap semantics.
fix(agent): force 429 classification for RateLimitError + rate-limit cooldown on primary restore #8023 rate-limit cooldown — re-added reason parameter to _try_activate_fallback() signature, wired reason=classified.reason at the primary rate-limit caller site (line ~10633), added cooldown check in _restore_primary_runtime(). Force-429 classification added directly to classify_api_error().

Not included

fix(switch_model): guard _fallback_chain access on partially-constructed agents #14731 @AndreKurait — superseded by fix: default missing fallback state in switch_model #14867's more thorough defensive initialization. Will close with credit.

…event non-retryable abort ssl.SSLError (and its subclass ssl.SSLCertVerificationError) inherits from OSError *and* ValueError via Python's MRO. The is_local_validation_error check used isinstance(api_error, (ValueError, TypeError)) to detect programming bugs that should abort immediately — but this inadvertently caught ssl.SSLError, treating a TLS transport failure as a non-retryable client error. The error classifier already maps SSLCertVerificationError to FailoverReason.timeout with retryable=True (its type name is in _TRANSPORT_ERROR_TYPES), but the inline isinstance guard was overriding that classification and triggering an unnecessary abort. Fix: add ssl.SSLError to the exclusion list alongside the existing UnicodeEncodeError carve-out so TLS errors fall through to the classifier's retryable path. Closes #14367

Try to activate fallback model after errors was calling get_model_context_length() without the config_context_length parameter, causing it to fall through to DEFAULT_FALLBACK_CONTEXT (128K) even when config.yaml has an explicit model.context_length value (e.g. 204800 for MiniMax-M2.7). This mirrors the fix already present in switch_model() at line 1988, which correctly passes config_context_length. The fallback path was missed. Fixes: context_length forced to 128K on fallback activation

… fails When the primary provider raises AuthError (expired OAuth token, revoked API key), the error was re-raised before AIAgent was created, so fallback_model was never consulted. Now both gateway/run.py and cron/scheduler.py catch AuthError specifically and attempt to resolve credentials from the fallback_providers/fallback_model config chain before propagating the error. Closes #7230

Bartok9 and others added 6 commits April 24, 2026 05:35

fix(agent): default missing fallback chain on switch

e999eac

fix(agent): only set rate-limit cooldown when leaving primary; add tests

8d28004

chore(release): map Group F contributors in AUTHOR_MAP

2445583

teknium1 merged commit fe9d9a2 into main Apr 24, 2026
10 of 11 checks passed

teknium1 deleted the hermes/hermes-172af8ae branch April 24, 2026 12:35

This was referenced Apr 24, 2026

fix(switch_model): guard _fallback_chain access on partially-constructed agents #14731

Closed

MiniMax switch_model credential guard test crashes on missing _fallback_chain #14864

Closed

Provider cooldown / circuit breaker for persistent failures #5436

Closed

alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery labels Apr 24, 2026

This was referenced Apr 24, 2026

fix(agent): guard switch_model when fallback chain is unset #15193

Closed

_restore_primary_runtime() doesn't check credential cooldown — burns retries every turn while provider is exhausted #15298

Closed

This was referenced Apr 27, 2026

fix(agent): exclude ssl.SSLError from is_local_validation_error to prevent non-retryable abort #14445

Closed

fix(gateway,cron): activate fallback_model when primary provider auth fails #7432

Closed

alt-glitch mentioned this pull request May 1, 2026

feat: provider cooldown / circuit breaker #5442

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): consolidated fallback/retry correctness — SSL retry, cooldown, config context, auth fallback parity#15134

fix(agent): consolidated fallback/retry correctness — SSL retry, cooldown, config context, auth fallback parity#15134
teknium1 merged 6 commits into
mainfrom
hermes/hermes-172af8ae

teknium1 commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

teknium1 commented Apr 24, 2026

Summary

Changes

Credit

Validation

Conflict resolutions

Not included

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants