Skip to content

fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)#21017

Merged
vincentkoc merged 4 commits intoopenclaw:mainfrom
taw0002:fix/failover-502-503-504
Feb 23, 2026
Merged

fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)#21017
vincentkoc merged 4 commits intoopenclaw:mainfrom
taw0002:fix/failover-502-503-504

Conversation

@taw0002
Copy link
Contributor

@taw0002 taw0002 commented Feb 19, 2026

Problem

When the primary model's API returns 502, 503, or 504, resolveFailoverReasonFromError() in failover-error.ts doesn't match any status-code branch (only 402/429/401/403/408/400 are handled). The error falls through to message-based classification via classifyFailoverReason(), which relies on extractLeadingHttpStatus() — this only works if the error message starts with the numeric status code (e.g. "503 Service Unavailable ...").

Many API SDKs (Google, Anthropic, OpenAI) set err.status = 503 as a property without prefixing the message string with 503, so the message-based classifier never matches and model failover never triggers. The run retries the same unavailable model indefinitely.

Fix

Add 502 || 503 || 504 to the status-code branch in resolveFailoverReasonFromError(), returning "timeout" — consistent with the existing behavior of isTransientHttpError() in the message-based classifier (which already includes TRANSIENT_HTTP_ERROR_CODES = new Set([500, 502, 503, 521, 522, 523, 524, 529])). 504 is also added since it represents a gateway timeout.

Test assertions added for all three status codes.

Why "timeout" and not "rate_limit"?

The message-based classifier (classifyFailoverReason) already maps transient HTTP errors to "timeout" via isTransientHttpError()return "timeout". Using the same reason ensures consistent behavior regardless of whether the error is classified by status code or message text.

Fixes #20999

Greptile Summary

Fixed model failover not triggering for HTTP 502/503/504 errors by adding explicit status-code branches in resolveFailoverReasonFromError(). These transient server errors now return "timeout" as the failover reason, consistent with the existing message-based classification for other transient errors.

  • Added status checks for 502, 503, and 504 in src/agents/failover-error.ts:164-166
  • Added test coverage for all three status codes
  • Fix prevents runs from retrying the same unavailable model indefinitely when API SDKs set err.status without prefixing the message string

Minor inconsistency: 504 is now treated as transient in the status-code branch but TRANSIENT_HTTP_ERROR_CODES (used by message-based classification) excludes it. This means classification could differ based on error format, though the practical impact is limited since most SDKs set err.status.

Confidence Score: 4/5

  • This PR is safe to merge with low risk
  • The fix is narrowly scoped, well-tested, and addresses a clear bug where failover wasn't triggering for common transient HTTP errors. The logic is straightforward and consistent with existing patterns. Score is 4 (not 5) due to a minor inconsistency between status-code and message-based classification paths for 504 errors, though this is unlikely to cause issues in practice.
  • No files require special attention

Last reviewed commit: c932dac

@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: XS labels Feb 19, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link

@nikolasdehor nikolasdehor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clean, well-scoped fix. I commented on #20999 about exactly this problem — my fallback chain is Codex → Opus, and when the primary returns 502/503/504 during a provider outage, the gateway should fail over to the next model instead of retrying the same unavailable one indefinitely.

The diff is minimal and correct:

  1. Status-code mapping is right. Adding 502 || 503 || 504 to resolveFailoverReasonFromError() closes the gap where err.status is set by the SDK but the error message doesn't start with the numeric code, so the message-based classifier never matches.

  2. "timeout" is the correct reason. My original comment on #20999 suggested "rate_limit", but after looking at the code more carefully, "timeout" is the right choice — the message-based classifier in classifyFailoverReason already maps transient HTTP errors to "timeout" via isTransientHttpError(), so this keeps both classification paths consistent.

  3. Tests cover all three codes. Simple and sufficient.

One minor note: Greptile flagged that 504 is now handled in the status-code path but is absent from TRANSIENT_HTTP_ERROR_CODES (used by message-based classification). Practically this doesn't matter — the status-code path takes priority when err.status is set, and if it isn't set, a 504 message with the right prefix would still need to be caught. But it might be worth a follow-up to add 504 to TRANSIENT_HTTP_ERROR_CODES for full consistency. Not a blocker.

Also worth noting: this PR covers HTTP status codes only, not connection-level errors (ECONNREFUSED, ETIMEDOUT, etc.). Those are a separate concern and shouldn't block this fix — getting 5xx failover working is the priority.

Tested mentally against my own setup: Codex primary returns 503 during outage → resolveFailoverReasonFromError now returns "timeout" → failover kicks in → Opus handles the request. Exactly what should happen.

@vincentkoc vincentkoc added the dedupe:parent Primary canonical item in dedupe cluster label Feb 23, 2026
@vincentkoc
Copy link
Member

vincentkoc commented Feb 23, 2026

This is now the canonical PR for #20999 after closing #21049.

One blocker before merge: check is failing on the merge ref due a formatting drift (src/agents/tools/cron-tool.ts) from the current base branch. Please rebase onto latest main and rerun CI so check is green on this PR head.

taw0002 and others added 2 commits February 23, 2026 00:43
When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999
@vincentkoc vincentkoc force-pushed the fix/failover-502-503-504 branch from c932dac to 90858ab Compare February 23, 2026 05:48
@vincentkoc vincentkoc force-pushed the fix/failover-502-503-504 branch from 1986812 to d792ba4 Compare February 23, 2026 07:57
@vincentkoc vincentkoc merged commit 3c57bf4 into openclaw:main Feb 23, 2026
23 of 25 checks passed
obviyus pushed a commit to jd316/openclaw that referenced this pull request Feb 23, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
jaydiamond42 pushed a commit to jaydiamond42/bloomtbot that referenced this pull request Feb 23, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
carlosrivera pushed a commit to myascendai/meshiclaw that referenced this pull request Feb 23, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
gabrielkoo pushed a commit to gabrielkoo/openclaw that referenced this pull request Feb 23, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
mreedr pushed a commit to mreedr/openclaw-custom that referenced this pull request Feb 24, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
margulans pushed a commit to margulans/Neiron-AI-assistant that referenced this pull request Feb 25, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
brianleach pushed a commit to brianleach/openclaw that referenced this pull request Feb 26, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
Yuyang-0 pushed a commit to Yuyang-0/openclaw that referenced this pull request Feb 26, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
mylukin pushed a commit to mylukin/openclaw that referenced this pull request Feb 26, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
r4jiv007 pushed a commit to r4jiv007/openclaw that referenced this pull request Feb 28, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
hughdidit pushed a commit to hughdidit/DAISy-Agency that referenced this pull request Mar 1, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
(cherry picked from commit 3c57bf4)

# Conflicts:
#	src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts
#	src/agents/pi-embedded-helpers/errors.ts
r4jiv007 pushed a commit to r4jiv007/openclaw that referenced this pull request Feb 28, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
hughdidit pushed a commit to hughdidit/DAISy-Agency that referenced this pull request Mar 1, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
(cherry picked from commit 3c57bf4)

# Conflicts:
#	src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts
#	src/agents/pi-embedded-helpers/errors.ts
hughdidit pushed a commit to hughdidit/DAISy-Agency that referenced this pull request Mar 3, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
(cherry picked from commit 3c57bf4)

# Conflicts:
#	src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts
#	src/agents/pi-embedded-helpers/errors.ts
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
…enclaw#21017)

* fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

When a model API returns 502 Bad Gateway, 503 Service Unavailable, or
504 Gateway Timeout, the error object carries the status code directly.
resolveFailoverReasonFromError() only checked 402/429/401/403/408/400,
so 5xx server errors fell through to message-based classification which
requires the status code to appear at the start of the error message.

Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing
the message with '503', so the message classifier never matched and
failover never triggered — the run retried the same broken model.

Add 502/503/504 to the status-code branch, returning 'timeout' (matching
the existing behavior of isTransientHttpError in the message classifier).

Fixes openclaw#20999

* Changelog: add failover 502/503/504 note with credits

* Failover: classify HTTP 504 as transient in message parser

* Changelog: credit taw0002 and vincentkoc for failover fix

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling dedupe:parent Primary canonical item in dedupe cluster size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Treat HTTP 503 (and 502/504) as failover-eligible so model fallback triggers

3 participants