fix(agents): classify Cloudflare/CDN HTML error pages as transport failures#67642
Conversation
Greptile SummaryFixes a real misclassification bug where CDN/Cloudflare HTML error pages (502, 503, 520–524) containing text like "Rate limit exceeded" were incorrectly classified as structured API rate-limit errors, causing wrong failover behavior and a stuck TUI. The fix adds an HTML short-circuit in Confidence Score: 5/5Safe to merge — the fix is correct, targeted, and well-tested for the primary regression scenarios. Both changed functions are correctly updated, the new upstream_html kind and user-facing message are properly placed, and the key regressions are all tested. The only gap is a missing test documenting the intentional behavior change for 403 HTML in classifyFailoverReason, which is P2 and does not block merging. No files require special attention. Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/provider-error-patterns.test.ts
Line: 164-169
Comment:
**Consider adding a `classifyFailoverReason` test for 403 HTML**
The HTML short-circuit in `classifyFailoverSignal` now fires for 403 HTML too, changing the failover classification from the old `"auth"` path to `"timeout"`. That behavior change is intentional (CDN 403 HTML = transport failure, not API auth failure), but it's not covered by any test here. Adding a case alongside the 502/503 tests would explicitly document the intent and prevent future regressions if the ordering of guards ever shifts.
```suggestion
it("classifies 403 HTML as timeout in failover signal (CDN block, not API auth)", () => {
const html403 =
"<!doctype html><html><head><title>403 Forbidden</title></head>" +
"<body><h1>Forbidden</h1></body></html>";
expect(classifyFailoverReason(`403 ${html403}`)).toBe("timeout");
});
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(agents): classify Cloudflare/CDN HTM..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2f43127de8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
2f43127 to
51869c3
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 51869c3e8f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 195f1fbe5c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ilures Fixes openclaw#67517 When a provider endpoint returns an HTML error page (e.g. Cloudflare 502/503/520-524), the pattern-based message classifiers would scan the HTML body and misinterpret embedded text like "Rate limit exceeded" as a structured rate_limit API error. This caused incorrect failover behavior (profile rotation instead of clean retry/fallback) and left the TUI stuck. Two fixes: 1. classifyFailoverSignal now short-circuits on HTML responses before running pattern matchers, returning "timeout" (transport failure) so retry/fallback handles them correctly. 2. classifyProviderRuntimeFailureKind now detects HTML errors at any status (not just 403), returning "upstream_html" for non-403 statuses with a clear user-facing message about CDN/gateway errors. Adds regression tests covering Cloudflare 502/503 HTML with embedded rate-limit text, 403 HTML (still classified as auth), and JSON rate-limit responses (still classified correctly).
195f1fb to
0060f57
Compare
0060f57 to
a456f98
Compare
obviyus
left a comment
There was a problem hiding this comment.
Verified the Cloudflare/CDN HTML failure path and confirmed the classifier now treats 5xx HTML pages as transport failures instead of misreading embedded body text as API rate limits.
Maintainer follow-up: preserved auth remediation for HTML 401/403, preserved proxy remediation for HTML 407, and fixed Error:-prefixed HTML status parsing while resolving the review-thread concerns.
Local gate: pnpm test src/agents/pi-embedded-helpers/provider-error-patterns.test.ts src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a456f98c91
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| isTransportHtmlErrorStatus(inferredStatus) && | ||
| isHtmlErrorResponse(signal.message, inferredStatus) | ||
| ) { |
There was a problem hiding this comment.
Restrict HTML timeout short-circuit to transient statuses
This new early return turns any HTML response except 401/403/407 into timeout, which now bypasses the explicit HTTP-status mapping in classifyFailoverClassificationFromHttpStatus (for example, 402 billing handling at errors.ts:556-558). A response like 402 <!doctype html>... previously classified as billing now becomes timeout, which changes failover behavior (e.g., retry-limit escalation in run/failover-policy.ts only treats non-timeout reasons as escalation candidates). Narrowing the short-circuit to transport-like statuses (5xx/408/499/etc.) would avoid regressing non-transient status handling.
Useful? React with 👍 / 👎.
…hanks @stainlu) * fix(agents): classify Cloudflare/CDN HTML error pages as transport failures Fixes openclaw#67517 When a provider endpoint returns an HTML error page (e.g. Cloudflare 502/503/520-524), the pattern-based message classifiers would scan the HTML body and misinterpret embedded text like "Rate limit exceeded" as a structured rate_limit API error. This caused incorrect failover behavior (profile rotation instead of clean retry/fallback) and left the TUI stuck. Two fixes: 1. classifyFailoverSignal now short-circuits on HTML responses before running pattern matchers, returning "timeout" (transport failure) so retry/fallback handles them correctly. 2. classifyProviderRuntimeFailureKind now detects HTML errors at any status (not just 403), returning "upstream_html" for non-403 statuses with a clear user-facing message about CDN/gateway errors. Adds regression tests covering Cloudflare 502/503 HTML with embedded rate-limit text, 403 HTML (still classified as auth), and JSON rate-limit responses (still classified correctly). * fix: preserve auth and proxy HTML classification * fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…hanks @stainlu) * fix(agents): classify Cloudflare/CDN HTML error pages as transport failures Fixes openclaw#67517 When a provider endpoint returns an HTML error page (e.g. Cloudflare 502/503/520-524), the pattern-based message classifiers would scan the HTML body and misinterpret embedded text like "Rate limit exceeded" as a structured rate_limit API error. This caused incorrect failover behavior (profile rotation instead of clean retry/fallback) and left the TUI stuck. Two fixes: 1. classifyFailoverSignal now short-circuits on HTML responses before running pattern matchers, returning "timeout" (transport failure) so retry/fallback handles them correctly. 2. classifyProviderRuntimeFailureKind now detects HTML errors at any status (not just 403), returning "upstream_html" for non-403 statuses with a clear user-facing message about CDN/gateway errors. Adds regression tests covering Cloudflare 502/503 HTML with embedded rate-limit text, 403 HTML (still classified as auth), and JSON rate-limit responses (still classified correctly). * fix: preserve auth and proxy HTML classification * fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…hanks @stainlu) * fix(agents): classify Cloudflare/CDN HTML error pages as transport failures Fixes openclaw#67517 When a provider endpoint returns an HTML error page (e.g. Cloudflare 502/503/520-524), the pattern-based message classifiers would scan the HTML body and misinterpret embedded text like "Rate limit exceeded" as a structured rate_limit API error. This caused incorrect failover behavior (profile rotation instead of clean retry/fallback) and left the TUI stuck. Two fixes: 1. classifyFailoverSignal now short-circuits on HTML responses before running pattern matchers, returning "timeout" (transport failure) so retry/fallback handles them correctly. 2. classifyProviderRuntimeFailureKind now detects HTML errors at any status (not just 403), returning "upstream_html" for non-403 statuses with a clear user-facing message about CDN/gateway errors. Adds regression tests covering Cloudflare 502/503 HTML with embedded rate-limit text, 403 HTML (still classified as auth), and JSON rate-limit responses (still classified correctly). * fix: preserve auth and proxy HTML classification * fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…hanks @stainlu) * fix(agents): classify Cloudflare/CDN HTML error pages as transport failures Fixes openclaw#67517 When a provider endpoint returns an HTML error page (e.g. Cloudflare 502/503/520-524), the pattern-based message classifiers would scan the HTML body and misinterpret embedded text like "Rate limit exceeded" as a structured rate_limit API error. This caused incorrect failover behavior (profile rotation instead of clean retry/fallback) and left the TUI stuck. Two fixes: 1. classifyFailoverSignal now short-circuits on HTML responses before running pattern matchers, returning "timeout" (transport failure) so retry/fallback handles them correctly. 2. classifyProviderRuntimeFailureKind now detects HTML errors at any status (not just 403), returning "upstream_html" for non-403 statuses with a clear user-facing message about CDN/gateway errors. Adds regression tests covering Cloudflare 502/503 HTML with embedded rate-limit text, 403 HTML (still classified as auth), and JSON rate-limit responses (still classified correctly). * fix: preserve auth and proxy HTML classification * fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…hanks @stainlu) * fix(agents): classify Cloudflare/CDN HTML error pages as transport failures Fixes openclaw#67517 When a provider endpoint returns an HTML error page (e.g. Cloudflare 502/503/520-524), the pattern-based message classifiers would scan the HTML body and misinterpret embedded text like "Rate limit exceeded" as a structured rate_limit API error. This caused incorrect failover behavior (profile rotation instead of clean retry/fallback) and left the TUI stuck. Two fixes: 1. classifyFailoverSignal now short-circuits on HTML responses before running pattern matchers, returning "timeout" (transport failure) so retry/fallback handles them correctly. 2. classifyProviderRuntimeFailureKind now detects HTML errors at any status (not just 403), returning "upstream_html" for non-403 statuses with a clear user-facing message about CDN/gateway errors. Adds regression tests covering Cloudflare 502/503 HTML with embedded rate-limit text, 403 HTML (still classified as auth), and JSON rate-limit responses (still classified correctly). * fix: preserve auth and proxy HTML classification * fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…hanks @stainlu) * fix(agents): classify Cloudflare/CDN HTML error pages as transport failures Fixes openclaw#67517 When a provider endpoint returns an HTML error page (e.g. Cloudflare 502/503/520-524), the pattern-based message classifiers would scan the HTML body and misinterpret embedded text like "Rate limit exceeded" as a structured rate_limit API error. This caused incorrect failover behavior (profile rotation instead of clean retry/fallback) and left the TUI stuck. Two fixes: 1. classifyFailoverSignal now short-circuits on HTML responses before running pattern matchers, returning "timeout" (transport failure) so retry/fallback handles them correctly. 2. classifyProviderRuntimeFailureKind now detects HTML errors at any status (not just 403), returning "upstream_html" for non-403 statuses with a clear user-facing message about CDN/gateway errors. Adds regression tests covering Cloudflare 502/503 HTML with embedded rate-limit text, 403 HTML (still classified as auth), and JSON rate-limit responses (still classified correctly). * fix: preserve auth and proxy HTML classification * fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…hanks @stainlu) * fix(agents): classify Cloudflare/CDN HTML error pages as transport failures Fixes openclaw#67517 When a provider endpoint returns an HTML error page (e.g. Cloudflare 502/503/520-524), the pattern-based message classifiers would scan the HTML body and misinterpret embedded text like "Rate limit exceeded" as a structured rate_limit API error. This caused incorrect failover behavior (profile rotation instead of clean retry/fallback) and left the TUI stuck. Two fixes: 1. classifyFailoverSignal now short-circuits on HTML responses before running pattern matchers, returning "timeout" (transport failure) so retry/fallback handles them correctly. 2. classifyProviderRuntimeFailureKind now detects HTML errors at any status (not just 403), returning "upstream_html" for non-403 statuses with a clear user-facing message about CDN/gateway errors. Adds regression tests covering Cloudflare 502/503 HTML with embedded rate-limit text, 403 HTML (still classified as auth), and JSON rate-limit responses (still classified correctly). * fix: preserve auth and proxy HTML classification * fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…hanks @stainlu) * fix(agents): classify Cloudflare/CDN HTML error pages as transport failures Fixes openclaw#67517 When a provider endpoint returns an HTML error page (e.g. Cloudflare 502/503/520-524), the pattern-based message classifiers would scan the HTML body and misinterpret embedded text like "Rate limit exceeded" as a structured rate_limit API error. This caused incorrect failover behavior (profile rotation instead of clean retry/fallback) and left the TUI stuck. Two fixes: 1. classifyFailoverSignal now short-circuits on HTML responses before running pattern matchers, returning "timeout" (transport failure) so retry/fallback handles them correctly. 2. classifyProviderRuntimeFailureKind now detects HTML errors at any status (not just 403), returning "upstream_html" for non-403 statuses with a clear user-facing message about CDN/gateway errors. Adds regression tests covering Cloudflare 502/503 HTML with embedded rate-limit text, 403 HTML (still classified as auth), and JSON rate-limit responses (still classified correctly). * fix: preserve auth and proxy HTML classification * fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…hanks @stainlu) * fix(agents): classify Cloudflare/CDN HTML error pages as transport failures Fixes openclaw#67517 When a provider endpoint returns an HTML error page (e.g. Cloudflare 502/503/520-524), the pattern-based message classifiers would scan the HTML body and misinterpret embedded text like "Rate limit exceeded" as a structured rate_limit API error. This caused incorrect failover behavior (profile rotation instead of clean retry/fallback) and left the TUI stuck. Two fixes: 1. classifyFailoverSignal now short-circuits on HTML responses before running pattern matchers, returning "timeout" (transport failure) so retry/fallback handles them correctly. 2. classifyProviderRuntimeFailureKind now detects HTML errors at any status (not just 403), returning "upstream_html" for non-403 statuses with a clear user-facing message about CDN/gateway errors. Adds regression tests covering Cloudflare 502/503 HTML with embedded rate-limit text, 403 HTML (still classified as auth), and JSON rate-limit responses (still classified correctly). * fix: preserve auth and proxy HTML classification * fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
Summary
rate_limitAPI error. This causes incorrect failover behavior (profile rotation instead of clean retry/fallback) and leaves the TUI stuck.classifyFailoverSignalruns text-pattern classifiers on raw error messages without first checking whether the message is an HTML page. HTML error pages from CDNs like Cloudflare often contain rate-limit or error keywords in their human-readable body text, which pattern matchers incorrectly classify as structured API errors. Additionally,classifyProviderRuntimeFailureKindonly checked for HTML responses on status 403 (auth_html_403), missing non-403 HTML pages entirely.classifyFailoverSignalnow short-circuits on HTML responses before running pattern matchers, returning"timeout"(transport failure) so retry/fallback handles them correctly.classifyProviderRuntimeFailureKindnow detects HTML errors at any status (not just 403), returning a new"upstream_html"kind for non-403 statuses with a clear user-facing message: "The provider returned an HTML error page instead of an API response."Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR