Skip to content

fix(failover): treat HTTP 5xx as rate-limit for model fallback#21049

Closed
maximalmargin wants to merge 1 commit intoopenclaw:mainfrom
maximalmargin:fix/http-5xx-failover
Closed

fix(failover): treat HTTP 5xx as rate-limit for model fallback#21049
maximalmargin wants to merge 1 commit intoopenclaw:mainfrom
maximalmargin:fix/http-5xx-failover

Conversation

@maximalmargin
Copy link

@maximalmargin maximalmargin commented Feb 19, 2026

Summary

Treat HTTP 502/503/504 as failover-eligible (rate_limit reason) so configured model fallbacks trigger when the primary provider is overloaded or temporarily unavailable.

Changes

  • Added handling for status codes 502, 503, 504 in resolveFailoverReasonFromError()
  • Treats these as rate_limit failures to enable existing fallback/cooldown behavior

Fixes

Closes #20999

Greptile Summary

Adds handling for HTTP 502 (Bad Gateway), 503 (Service Unavailable), and 504 (Gateway Timeout) status codes in resolveFailoverReasonFromError(), treating them as rate_limit failures to enable model fallback when the primary provider is overloaded or temporarily unavailable.

  • Maps 502/503/504 errors to rate_limit reason, which triggers the existing failover/cooldown behavior in runWithModelFallback()
  • Aligns with existing isTransientHttpError() logic in pi-embedded-helpers/errors.ts which treats 500, 502, 503 and Cloudflare 5xx codes as transient failures (mapped to timeout)
  • Simple, focused change that enables configured model fallbacks to activate on server-side failures

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The change is a straightforward, logical extension to error classification that properly handles server-side errors. It maps 502/503/504 status codes to rate_limit reason, which is semantically appropriate for temporary unavailability. The code follows existing patterns, has clear comments, and integrates seamlessly with the existing failover infrastructure. The implementation is simple and doesn't introduce new dependencies or complex logic.
  • No files require special attention

Last reviewed commit: b09e1b0

Treat 502/503/504 as failover-eligible (rate_limit reason) so
configured model fallbacks trigger when the primary provider is
overloaded or temporarily unavailable.

Fixes openclaw#20999
@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: XS labels Feb 19, 2026
@maximalmargin maximalmargin marked this pull request as ready for review February 19, 2026 16:04
@vincentkoc
Copy link
Member

Nice catch on the failover gap.

I’m closing this as a duplicate of #21017 for #20999. We’re keeping #21017 because it includes explicit regression tests and aligns 502/503/504 handling with the existing timeout/transient classification path.

Your PR helped validate the root cause and urgency. If you want this reopened for an alternative reason-mapping approach, say the word and we can re-check quickly.

@vincentkoc vincentkoc closed this Feb 23, 2026
@vincentkoc vincentkoc added dedupe:child Duplicate issue/PR child in dedupe cluster close:duplicate Closed as duplicate labels Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling close:duplicate Closed as duplicate dedupe:child Duplicate issue/PR child in dedupe cluster size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Treat HTTP 503 (and 502/504) as failover-eligible so model fallback triggers

2 participants