Skip to content

fix(agents): continue fallback loop for unrecognized provider errors#26106

Merged
steipete merged 2 commits intoopenclaw:mainfrom
Sid-Qin:fix/model-fallback-exhaustion-25926
Feb 25, 2026
Merged

fix(agents): continue fallback loop for unrecognized provider errors#26106
steipete merged 2 commits intoopenclaw:mainfrom
Sid-Qin:fix/model-fallback-exhaustion-25926

Conversation

@Sid-Qin
Copy link
Contributor

@Sid-Qin Sid-Qin commented Feb 25, 2026

Summary

  • Problem: Model fallback stops after 2 models when a provider returns an error that coerceToFailoverError cannot classify, even though 17 fallback models are configured
  • Why it matters: Users experience 30–60 min downtime waiting for cooldown to expire, even when other providers are healthy
  • What changed: In runWithModelFallback (src/agents/model-fallback.ts), unrecognized errors now continue the fallback loop instead of immediately rethrowing; rethrow only occurs on the last candidate
  • What did NOT change: Auth errors, rate-limit cooldown, and context-overflow errors behave exactly as before

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

When a provider returns an unrecognized error (not auth, rate-limit, or context-overflow), the system now continues trying remaining fallback models instead of aborting. Users will see fewer All models failed (2) errors when many fallbacks are configured.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • Model/provider: Multiple providers (e.g., qwen-portal, opencode, nvidia-nim, xfyun)
  • Relevant config: 17 fallback models across 4 providers

Steps

  1. Configure multiple fallback models across providers
  2. Wait for first two providers to hit rate-limit cooldown
  3. Send a message to trigger model fallback

Expected

  • System skips cooled-down providers and continues to remaining fallback models

Actual (before fix)

  • Error after 2 models: All models failed (2): ... Provider X is in cooldown | Provider Y is in cooldown

Evidence

  • Failing test/log before + passing after
  • Updated test: unrecognized errors with remaining candidates now trigger fallback to next model
  • New test: unrecognized error on last candidate is correctly rethrown
  • All model-fallback tests pass

Human Verification (required)

  • Verified scenarios: unrecognized error mid-chain continues fallback; unrecognized error on last candidate throws; auth/rate-limit errors behave as before
  • Edge cases checked: single candidate with fallbacksOverride: [], error on last candidate
  • What you did not verify: live multi-provider setup with actual rate-limiting

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert commit 265364e
  • Files/config to restore: src/agents/model-fallback.ts
  • Known bad symptoms: Non-retryable errors being retried across all candidates instead of failing fast

Risks and Mitigations

  • Risk: Some errors that were previously non-retryable might now be retried across all candidates, adding latency
    • Mitigation: Context-overflow and abort errors are still handled as non-retryable; only truly unclassified errors continue the loop

SidQin-cyber and others added 2 commits February 25, 2026 04:52
When a provider returns an error that coerceToFailoverError cannot
classify (e.g., custom error messages without standard HTTP status
codes), the fallback loop threw immediately instead of trying the
next candidate. This caused fallback to stop after 2 models even
when 17 were configured.

Only rethrow unrecognized errors when they occur on the last
candidate. For intermediate candidates, record the error as an
attempt and continue to the next model.

Closes openclaw#25926

Co-authored-by: Cursor <cursoragent@cursor.com>
@steipete steipete force-pushed the fix/model-fallback-exhaustion-25926 branch from 265364e to f78bf75 Compare February 25, 2026 04:53
@steipete steipete merged commit 156f13a into openclaw:main Feb 25, 2026
9 checks passed
@steipete
Copy link
Contributor

Landed via temp rebase onto main.

  • Gate: pnpm test src/agents/model-fallback.test.ts && pnpm check
  • Land commit: f78bf75
  • Merge commit: 156f13a

Thanks @Sid-Qin!

steipete added a commit to justinhuangcode/openclaw that referenced this pull request Feb 25, 2026
…penclaw#26106)

* fix(agents): continue fallback loop for unrecognized provider errors

When a provider returns an error that coerceToFailoverError cannot
classify (e.g., custom error messages without standard HTTP status
codes), the fallback loop threw immediately instead of trying the
next candidate. This caused fallback to stop after 2 models even
when 17 were configured.

Only rethrow unrecognized errors when they occur on the last
candidate. For intermediate candidates, record the error as an
attempt and continue to the next model.

Closes openclaw#25926

Co-authored-by: Cursor <cursoragent@cursor.com>

* test: cover unknown-error fallback telemetry and land openclaw#26106 (thanks @Sid-Qin)

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Jackson3195 pushed a commit to Jackson3195/openclaw-with-a-personal-touch that referenced this pull request Feb 25, 2026
…penclaw#26106)

* fix(agents): continue fallback loop for unrecognized provider errors

When a provider returns an error that coerceToFailoverError cannot
classify (e.g., custom error messages without standard HTTP status
codes), the fallback loop threw immediately instead of trying the
next candidate. This caused fallback to stop after 2 models even
when 17 were configured.

Only rethrow unrecognized errors when they occur on the last
candidate. For intermediate candidates, record the error as an
attempt and continue to the next model.

Closes openclaw#25926

Co-authored-by: Cursor <cursoragent@cursor.com>

* test: cover unknown-error fallback telemetry and land openclaw#26106 (thanks @Sid-Qin)

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
brianleach pushed a commit to brianleach/openclaw that referenced this pull request Feb 26, 2026
…penclaw#26106)

* fix(agents): continue fallback loop for unrecognized provider errors

When a provider returns an error that coerceToFailoverError cannot
classify (e.g., custom error messages without standard HTTP status
codes), the fallback loop threw immediately instead of trying the
next candidate. This caused fallback to stop after 2 models even
when 17 were configured.

Only rethrow unrecognized errors when they occur on the last
candidate. For intermediate candidates, record the error as an
attempt and continue to the next model.

Closes openclaw#25926

Co-authored-by: Cursor <cursoragent@cursor.com>

* test: cover unknown-error fallback telemetry and land openclaw#26106 (thanks @Sid-Qin)

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
execute008 pushed a commit to execute008/openclaw that referenced this pull request Feb 27, 2026
…penclaw#26106)

* fix(agents): continue fallback loop for unrecognized provider errors

When a provider returns an error that coerceToFailoverError cannot
classify (e.g., custom error messages without standard HTTP status
codes), the fallback loop threw immediately instead of trying the
next candidate. This caused fallback to stop after 2 models even
when 17 were configured.

Only rethrow unrecognized errors when they occur on the last
candidate. For intermediate candidates, record the error as an
attempt and continue to the next model.

Closes openclaw#25926

Co-authored-by: Cursor <cursoragent@cursor.com>

* test: cover unknown-error fallback telemetry and land openclaw#26106 (thanks @Sid-Qin)

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
r4jiv007 pushed a commit to r4jiv007/openclaw that referenced this pull request Feb 28, 2026
…penclaw#26106)

* fix(agents): continue fallback loop for unrecognized provider errors

When a provider returns an error that coerceToFailoverError cannot
classify (e.g., custom error messages without standard HTTP status
codes), the fallback loop threw immediately instead of trying the
next candidate. This caused fallback to stop after 2 models even
when 17 were configured.

Only rethrow unrecognized errors when they occur on the last
candidate. For intermediate candidates, record the error as an
attempt and continue to the next model.

Closes openclaw#25926

Co-authored-by: Cursor <cursoragent@cursor.com>

* test: cover unknown-error fallback telemetry and land openclaw#26106 (thanks @Sid-Qin)

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
…penclaw#26106)

* fix(agents): continue fallback loop for unrecognized provider errors

When a provider returns an error that coerceToFailoverError cannot
classify (e.g., custom error messages without standard HTTP status
codes), the fallback loop threw immediately instead of trying the
next candidate. This caused fallback to stop after 2 models even
when 17 were configured.

Only rethrow unrecognized errors when they occur on the last
candidate. For intermediate candidates, record the error as an
attempt and continue to the next model.

Closes openclaw#25926

Co-authored-by: Cursor <cursoragent@cursor.com>

* test: cover unknown-error fallback telemetry and land openclaw#26106 (thanks @Sid-Qin)

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
thebenjaminlee pushed a commit to escape-velocity-ventures/openclaw that referenced this pull request Mar 7, 2026
…penclaw#26106)

* fix(agents): continue fallback loop for unrecognized provider errors

When a provider returns an error that coerceToFailoverError cannot
classify (e.g., custom error messages without standard HTTP status
codes), the fallback loop threw immediately instead of trying the
next candidate. This caused fallback to stop after 2 models even
when 17 were configured.

Only rethrow unrecognized errors when they occur on the last
candidate. For intermediate candidates, record the error as an
attempt and continue to the next model.

Closes openclaw#25926

Co-authored-by: Cursor <cursoragent@cursor.com>

* test: cover unknown-error fallback telemetry and land openclaw#26106 (thanks @Sid-Qin)

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Model fallback stops after 2 models instead of trying all configured fallbacks

2 participants