fix(agent): consolidated fallback/retry correctness — SSL retry, cooldown, config context, auth fallback parity#15134
Merged
Merged
Conversation
…event non-retryable abort ssl.SSLError (and its subclass ssl.SSLCertVerificationError) inherits from OSError *and* ValueError via Python's MRO. The is_local_validation_error check used isinstance(api_error, (ValueError, TypeError)) to detect programming bugs that should abort immediately — but this inadvertently caught ssl.SSLError, treating a TLS transport failure as a non-retryable client error. The error classifier already maps SSLCertVerificationError to FailoverReason.timeout with retryable=True (its type name is in _TRANSPORT_ERROR_TYPES), but the inline isinstance guard was overriding that classification and triggering an unnecessary abort. Fix: add ssl.SSLError to the exclusion list alongside the existing UnicodeEncodeError carve-out so TLS errors fall through to the classifier's retryable path. Closes #14367
Try to activate fallback model after errors was calling get_model_context_length() without the config_context_length parameter, causing it to fall through to DEFAULT_FALLBACK_CONTEXT (128K) even when config.yaml has an explicit model.context_length value (e.g. 204800 for MiniMax-M2.7). This mirrors the fix already present in switch_model() at line 1988, which correctly passes config_context_length. The fallback path was missed. Fixes: context_length forced to 128K on fallback activation
… fails When the primary provider raises AuthError (expired OAuth token, revoked API key), the error was re-raised before AIAgent was created, so fallback_model was never consulted. Now both gateway/run.py and cron/scheduler.py catch AuthError specifically and attempt to resolve credentials from the fallback_providers/fallback_model config chain before propagating the error. Closes #7230
This was referenced Apr 24, 2026
This was referenced Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidated salvage of 5 retry / fallback-chain correctness PRs touching the agent loop. Attribution preserved via rebase-merge. All changes preserve prompt-cache integrity (no context mutation, no mid-session reloads).
Changes
run_agent.py(SSL retry) —ssl.SSLErrorinherits fromValueErrorvia Python MRO, so theis_local_validation_error = isinstance(api_error, (ValueError, TypeError))check misclassified TLS transport failures as "local programming bug" and aborted without retrying. Addedssl.SSLErrorto the exclusion tuple alongsideUnicodeEncodeErrorandjson.JSONDecodeError(fix(agent): retry on json.JSONDecodeError instead of treating it as a local validation error #15107). From @Bartok9 (fix(agent): exclude ssl.SSLError from is_local_validation_error to prevent non-retryable abort #14445).run_agent.py(fallback config context) —_try_activate_fallback()passedbase_url/api_key/providerintoget_model_context_length()but droppedconfig_context_length, so explicitmodel.context_lengthoverrides inconfig.yamlwere ignored once the fallback activated (falling back to 128K even when config said 204800). From @CruxExperts (fix(agent): pass config_context_length in fallback activation path #14727).run_agent.py(switch_model fallback state guard) —switch_model()wroteself._fallback_chain = [pruned]only inside the provider-mismatchifbranch. When called on a partially-constructed agent (viaAIAgent.__new__in tests) or whenold_norm == new_norm, it left_fallback_chainunset and crashed on the next access. Now always initializes viagetattr(self, "_fallback_chain", []) or []and assigns back unconditionally. From @LeonSGP43 (fix: default missing fallback state in switch_model #14867).agent/error_classifier.py+run_agent.py(429 + cooldown) — Two-part fix:classify_api_error()force-setsstatus_code=429when the error type isRateLimitErrorbut.status_codeis missing (Copilot / GitHub Models SDK quirk), so downstream rate-limit handling fires._try_activate_fallback(reason=...)setsself._rate_limited_until = time.monotonic() + 60only when leaving the primary provider (not when chain-advancing between fallbacks)._restore_primary_runtime()now respects the cooldown and stays on the working fallback until it expires, preventing the flip-flop retry loop that burned the budget on exhausted primaries. From @vlwkaos (fix(agent): force 429 classification for RateLimitError + rate-limit cooldown on primary restore #8023).gateway/run.py+cron/scheduler.py(auth fallback parity) — Mirrors the CLI fallback-on-AuthError path (fix(codex): consolidated OAuth error parsing + failed-status fallback routing + reauth UX #15104 / @A-FdL-Prog's fix(codex): route auth failures to fallback provider chain #5948) for the gateway and cron contexts. Whenresolve_runtime_provider()raisesAuthErrorat session/job startup, walk thefallback_providerschain and switch to the first working provider instead of failing the message/job. From @Tranquil-Flow (fix(gateway,cron): activate fallback_model when primary provider auth fails #7432).Credit
Validation
226/226 passing.
run_agent.py,cli.py,agent/error_classifier.py,gateway/run.py,cron/scheduler.pyall compile.Conflict resolutions
(UnicodeEncodeError, json.JSONDecodeError)tuple from fix(agent): retry on json.JSONDecodeError instead of treating it as a local validation error #15107.reasonparameter to_try_activate_fallback()signature, wiredreason=classified.reasonat the primary rate-limit caller site (line ~10633), added cooldown check in_restore_primary_runtime(). Force-429 classification added directly toclassify_api_error().Not included