Skip to content

fix(agent): consolidated fallback/retry correctness — SSL retry, cooldown, config context, auth fallback parity#15134

Merged
teknium1 merged 6 commits into
mainfrom
hermes/hermes-172af8ae
Apr 24, 2026
Merged

fix(agent): consolidated fallback/retry correctness — SSL retry, cooldown, config context, auth fallback parity#15134
teknium1 merged 6 commits into
mainfrom
hermes/hermes-172af8ae

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Consolidated salvage of 5 retry / fallback-chain correctness PRs touching the agent loop. Attribution preserved via rebase-merge. All changes preserve prompt-cache integrity (no context mutation, no mid-session reloads).

Changes

Credit

Validation

scripts/run_tests.sh tests/agent/test_error_classifier.py \
  tests/run_agent/test_primary_runtime_restore.py \
  tests/run_agent/test_switch_model_fallback_prune.py \
  tests/run_agent/test_provider_fallback.py \
  tests/gateway/test_auth_fallback.py \
  tests/run_agent/test_run_agent_codex_responses.py \
  tests/run_agent/test_jsondecodeerror_retryable.py

226/226 passing. run_agent.py, cli.py, agent/error_classifier.py, gateway/run.py, cron/scheduler.py all compile.

Conflict resolutions

Not included

Bartok9 and others added 6 commits April 24, 2026 05:35
…event non-retryable abort

ssl.SSLError (and its subclass ssl.SSLCertVerificationError) inherits from
OSError *and* ValueError via Python's MRO. The is_local_validation_error
check used isinstance(api_error, (ValueError, TypeError)) to detect
programming bugs that should abort immediately — but this inadvertently
caught ssl.SSLError, treating a TLS transport failure as a non-retryable
client error.

The error classifier already maps SSLCertVerificationError to
FailoverReason.timeout with retryable=True (its type name is in
_TRANSPORT_ERROR_TYPES), but the inline isinstance guard was overriding
that classification and triggering an unnecessary abort.

Fix: add ssl.SSLError to the exclusion list alongside the existing
UnicodeEncodeError carve-out so TLS errors fall through to the
classifier's retryable path.

Closes #14367
Try to activate fallback model after errors was calling get_model_context_length()
without the config_context_length parameter, causing it to fall through to
DEFAULT_FALLBACK_CONTEXT (128K) even when config.yaml has an explicit
model.context_length value (e.g. 204800 for MiniMax-M2.7).

This mirrors the fix already present in switch_model() at line 1988, which
correctly passes config_context_length. The fallback path was missed.

Fixes: context_length forced to 128K on fallback activation
… fails

When the primary provider raises AuthError (expired OAuth token,
revoked API key), the error was re-raised before AIAgent was created,
so fallback_model was never consulted. Now both gateway/run.py and
cron/scheduler.py catch AuthError specifically and attempt to resolve
credentials from the fallback_providers/fallback_model config chain
before propagating the error.

Closes #7230
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants