Skip to content

fix(agents): classify Cloudflare/CDN HTML error pages as transport failures#67642

Merged
obviyus merged 3 commits into
openclaw:mainfrom
stainlu:fix/issue-67517-cloudflare-html-misclassification
Apr 16, 2026
Merged

fix(agents): classify Cloudflare/CDN HTML error pages as transport failures#67642
obviyus merged 3 commits into
openclaw:mainfrom
stainlu:fix/issue-67517-cloudflare-html-misclassification

Conversation

@stainlu

@stainlu stainlu commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: When a provider endpoint returns an HTML error page (e.g. Cloudflare 502/503/520-524), the pattern-based message classifiers scan the HTML body and misinterpret embedded text like "Rate limit exceeded" as a structured rate_limit API error. This causes incorrect failover behavior (profile rotation instead of clean retry/fallback) and leaves the TUI stuck.
  • Root cause: classifyFailoverSignal runs text-pattern classifiers on raw error messages without first checking whether the message is an HTML page. HTML error pages from CDNs like Cloudflare often contain rate-limit or error keywords in their human-readable body text, which pattern matchers incorrectly classify as structured API errors. Additionally, classifyProviderRuntimeFailureKind only checked for HTML responses on status 403 (auth_html_403), missing non-403 HTML pages entirely.
  • Fix:
    1. classifyFailoverSignal now short-circuits on HTML responses before running pattern matchers, returning "timeout" (transport failure) so retry/fallback handles them correctly.
    2. classifyProviderRuntimeFailureKind now detects HTML errors at any status (not just 403), returning a new "upstream_html" kind for non-403 statuses with a clear user-facing message: "The provider returned an HTML error page instead of an API response."
    3. Regression tests covering Cloudflare 502/503 HTML with embedded rate-limit text, 403 HTML preservation, and JSON rate-limit correctness.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S labels Apr 16, 2026
@greptile-apps

greptile-apps Bot commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Fixes a real misclassification bug where CDN/Cloudflare HTML error pages (502, 503, 520–524) containing text like "Rate limit exceeded" were incorrectly classified as structured API rate-limit errors, causing wrong failover behavior and a stuck TUI. The fix adds an HTML short-circuit in classifyFailoverSignal before pattern matchers run, and generalizes the HTML detection in classifyProviderRuntimeFailureKind to cover non-403 statuses with a new "upstream_html" kind and clear user-facing message. Regression tests are included for the key scenarios.

Confidence Score: 5/5

Safe to merge — the fix is correct, targeted, and well-tested for the primary regression scenarios.

Both changed functions are correctly updated, the new upstream_html kind and user-facing message are properly placed, and the key regressions are all tested. The only gap is a missing test documenting the intentional behavior change for 403 HTML in classifyFailoverReason, which is P2 and does not block merging.

No files require special attention.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-helpers/provider-error-patterns.test.ts
Line: 164-169

Comment:
**Consider adding a `classifyFailoverReason` test for 403 HTML**

The HTML short-circuit in `classifyFailoverSignal` now fires for 403 HTML too, changing the failover classification from the old `"auth"` path to `"timeout"`. That behavior change is intentional (CDN 403 HTML = transport failure, not API auth failure), but it's not covered by any test here. Adding a case alongside the 502/503 tests would explicitly document the intent and prevent future regressions if the ordering of guards ever shifts.

```suggestion
  it("classifies 403 HTML as timeout in failover signal (CDN block, not API auth)", () => {
    const html403 =
      "<!doctype html><html><head><title>403 Forbidden</title></head>" +
      "<body><h1>Forbidden</h1></body></html>";
    expect(classifyFailoverReason(`403 ${html403}`)).toBe("timeout");
  });
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(agents): classify Cloudflare/CDN HTM..." | Re-trigger Greptile

Comment thread src/agents/pi-embedded-helpers/provider-error-patterns.test.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f43127de8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/pi-embedded-helpers/errors.ts Outdated
Comment thread src/agents/pi-embedded-helpers/errors.ts Outdated
@stainlu stainlu force-pushed the fix/issue-67517-cloudflare-html-misclassification branch from 2f43127 to 51869c3 Compare April 16, 2026 11:43

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51869c3e8f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/pi-embedded-helpers/errors.ts Outdated
Comment thread src/agents/pi-embedded-helpers/errors.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 195f1fbe5c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/pi-embedded-helpers/errors.ts
@obviyus obviyus self-assigned this Apr 16, 2026
stainlu and others added 2 commits April 16, 2026 18:12
…ilures

Fixes openclaw#67517

When a provider endpoint returns an HTML error page (e.g. Cloudflare
502/503/520-524), the pattern-based message classifiers would scan
the HTML body and misinterpret embedded text like "Rate limit
exceeded" as a structured rate_limit API error. This caused
incorrect failover behavior (profile rotation instead of clean
retry/fallback) and left the TUI stuck.

Two fixes:
1. classifyFailoverSignal now short-circuits on HTML responses
   before running pattern matchers, returning "timeout" (transport
   failure) so retry/fallback handles them correctly.
2. classifyProviderRuntimeFailureKind now detects HTML errors at
   any status (not just 403), returning "upstream_html" for
   non-403 statuses with a clear user-facing message about
   CDN/gateway errors.

Adds regression tests covering Cloudflare 502/503 HTML with
embedded rate-limit text, 403 HTML (still classified as auth),
and JSON rate-limit responses (still classified correctly).
@obviyus obviyus force-pushed the fix/issue-67517-cloudflare-html-misclassification branch from 195f1fb to 0060f57 Compare April 16, 2026 12:44
@obviyus obviyus force-pushed the fix/issue-67517-cloudflare-html-misclassification branch from 0060f57 to a456f98 Compare April 16, 2026 12:48

@obviyus obviyus left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the Cloudflare/CDN HTML failure path and confirmed the classifier now treats 5xx HTML pages as transport failures instead of misreading embedded body text as API rate limits.

Maintainer follow-up: preserved auth remediation for HTML 401/403, preserved proxy remediation for HTML 407, and fixed Error:-prefixed HTML status parsing while resolving the review-thread concerns.

Local gate: pnpm test src/agents/pi-embedded-helpers/provider-error-patterns.test.ts src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts src/agents/pi-embedded-helpers.formatassistanterrortext.test.ts.

@obviyus obviyus merged commit e588e90 into openclaw:main Apr 16, 2026
9 checks passed
@obviyus

obviyus commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Landed on main.

Thanks @stainlu.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a456f98c91

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +774 to +776
isTransportHtmlErrorStatus(inferredStatus) &&
isHtmlErrorResponse(signal.message, inferredStatus)
) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict HTML timeout short-circuit to transient statuses

This new early return turns any HTML response except 401/403/407 into timeout, which now bypasses the explicit HTTP-status mapping in classifyFailoverClassificationFromHttpStatus (for example, 402 billing handling at errors.ts:556-558). A response like 402 <!doctype html>... previously classified as billing now becomes timeout, which changes failover behavior (e.g., retry-limit escalation in run/failover-policy.ts only treats non-timeout reasons as escalation candidates). Narrowing the short-circuit to transport-like statuses (5xx/408/499/etc.) would avoid regressing non-transient status handling.

Useful? React with 👍 / 👎.

xudaiyanzi pushed a commit to xudaiyanzi/openclaw that referenced this pull request Apr 17, 2026
…hanks @stainlu)

* fix(agents): classify Cloudflare/CDN HTML error pages as transport failures

Fixes openclaw#67517

When a provider endpoint returns an HTML error page (e.g. Cloudflare
502/503/520-524), the pattern-based message classifiers would scan
the HTML body and misinterpret embedded text like "Rate limit
exceeded" as a structured rate_limit API error. This caused
incorrect failover behavior (profile rotation instead of clean
retry/fallback) and left the TUI stuck.

Two fixes:
1. classifyFailoverSignal now short-circuits on HTML responses
   before running pattern matchers, returning "timeout" (transport
   failure) so retry/fallback handles them correctly.
2. classifyProviderRuntimeFailureKind now detects HTML errors at
   any status (not just 403), returning "upstream_html" for
   non-403 statuses with a clear user-facing message about
   CDN/gateway errors.

Adds regression tests covering Cloudflare 502/503 HTML with
embedded rate-limit text, 403 HTML (still classified as auth),
and JSON rate-limit responses (still classified correctly).

* fix: preserve auth and proxy HTML classification

* fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
kvnkho pushed a commit to kvnkho/openclaw that referenced this pull request Apr 17, 2026
…hanks @stainlu)

* fix(agents): classify Cloudflare/CDN HTML error pages as transport failures

Fixes openclaw#67517

When a provider endpoint returns an HTML error page (e.g. Cloudflare
502/503/520-524), the pattern-based message classifiers would scan
the HTML body and misinterpret embedded text like "Rate limit
exceeded" as a structured rate_limit API error. This caused
incorrect failover behavior (profile rotation instead of clean
retry/fallback) and left the TUI stuck.

Two fixes:
1. classifyFailoverSignal now short-circuits on HTML responses
   before running pattern matchers, returning "timeout" (transport
   failure) so retry/fallback handles them correctly.
2. classifyProviderRuntimeFailureKind now detects HTML errors at
   any status (not just 403), returning "upstream_html" for
   non-403 statuses with a clear user-facing message about
   CDN/gateway errors.

Adds regression tests covering Cloudflare 502/503 HTML with
embedded rate-limit text, 403 HTML (still classified as auth),
and JSON rate-limit responses (still classified correctly).

* fix: preserve auth and proxy HTML classification

* fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
Mquarmoc pushed a commit to Mquarmoc/openclaw that referenced this pull request Apr 20, 2026
…hanks @stainlu)

* fix(agents): classify Cloudflare/CDN HTML error pages as transport failures

Fixes openclaw#67517

When a provider endpoint returns an HTML error page (e.g. Cloudflare
502/503/520-524), the pattern-based message classifiers would scan
the HTML body and misinterpret embedded text like "Rate limit
exceeded" as a structured rate_limit API error. This caused
incorrect failover behavior (profile rotation instead of clean
retry/fallback) and left the TUI stuck.

Two fixes:
1. classifyFailoverSignal now short-circuits on HTML responses
   before running pattern matchers, returning "timeout" (transport
   failure) so retry/fallback handles them correctly.
2. classifyProviderRuntimeFailureKind now detects HTML errors at
   any status (not just 403), returning "upstream_html" for
   non-403 statuses with a clear user-facing message about
   CDN/gateway errors.

Adds regression tests covering Cloudflare 502/503 HTML with
embedded rate-limit text, 403 HTML (still classified as auth),
and JSON rate-limit responses (still classified correctly).

* fix: preserve auth and proxy HTML classification

* fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
lovewanwan pushed a commit to lovewanwan/openclaw that referenced this pull request Apr 28, 2026
…hanks @stainlu)

* fix(agents): classify Cloudflare/CDN HTML error pages as transport failures

Fixes openclaw#67517

When a provider endpoint returns an HTML error page (e.g. Cloudflare
502/503/520-524), the pattern-based message classifiers would scan
the HTML body and misinterpret embedded text like "Rate limit
exceeded" as a structured rate_limit API error. This caused
incorrect failover behavior (profile rotation instead of clean
retry/fallback) and left the TUI stuck.

Two fixes:
1. classifyFailoverSignal now short-circuits on HTML responses
   before running pattern matchers, returning "timeout" (transport
   failure) so retry/fallback handles them correctly.
2. classifyProviderRuntimeFailureKind now detects HTML errors at
   any status (not just 403), returning "upstream_html" for
   non-403 statuses with a clear user-facing message about
   CDN/gateway errors.

Adds regression tests covering Cloudflare 502/503 HTML with
embedded rate-limit text, 403 HTML (still classified as auth),
and JSON rate-limit responses (still classified correctly).

* fix: preserve auth and proxy HTML classification

* fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
…hanks @stainlu)

* fix(agents): classify Cloudflare/CDN HTML error pages as transport failures

Fixes openclaw#67517

When a provider endpoint returns an HTML error page (e.g. Cloudflare
502/503/520-524), the pattern-based message classifiers would scan
the HTML body and misinterpret embedded text like "Rate limit
exceeded" as a structured rate_limit API error. This caused
incorrect failover behavior (profile rotation instead of clean
retry/fallback) and left the TUI stuck.

Two fixes:
1. classifyFailoverSignal now short-circuits on HTML responses
   before running pattern matchers, returning "timeout" (transport
   failure) so retry/fallback handles them correctly.
2. classifyProviderRuntimeFailureKind now detects HTML errors at
   any status (not just 403), returning "upstream_html" for
   non-403 statuses with a clear user-facing message about
   CDN/gateway errors.

Adds regression tests covering Cloudflare 502/503 HTML with
embedded rate-limit text, 403 HTML (still classified as auth),
and JSON rate-limit responses (still classified correctly).

* fix: preserve auth and proxy HTML classification

* fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
…hanks @stainlu)

* fix(agents): classify Cloudflare/CDN HTML error pages as transport failures

Fixes openclaw#67517

When a provider endpoint returns an HTML error page (e.g. Cloudflare
502/503/520-524), the pattern-based message classifiers would scan
the HTML body and misinterpret embedded text like "Rate limit
exceeded" as a structured rate_limit API error. This caused
incorrect failover behavior (profile rotation instead of clean
retry/fallback) and left the TUI stuck.

Two fixes:
1. classifyFailoverSignal now short-circuits on HTML responses
   before running pattern matchers, returning "timeout" (transport
   failure) so retry/fallback handles them correctly.
2. classifyProviderRuntimeFailureKind now detects HTML errors at
   any status (not just 403), returning "upstream_html" for
   non-403 statuses with a clear user-facing message about
   CDN/gateway errors.

Adds regression tests covering Cloudflare 502/503 HTML with
embedded rate-limit text, 403 HTML (still classified as auth),
and JSON rate-limit responses (still classified correctly).

* fix: preserve auth and proxy HTML classification

* fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
globalcaos pushed a commit to globalcaos/tinkerclaw that referenced this pull request May 13, 2026
…hanks @stainlu)

* fix(agents): classify Cloudflare/CDN HTML error pages as transport failures

Fixes openclaw#67517

When a provider endpoint returns an HTML error page (e.g. Cloudflare
502/503/520-524), the pattern-based message classifiers would scan
the HTML body and misinterpret embedded text like "Rate limit
exceeded" as a structured rate_limit API error. This caused
incorrect failover behavior (profile rotation instead of clean
retry/fallback) and left the TUI stuck.

Two fixes:
1. classifyFailoverSignal now short-circuits on HTML responses
   before running pattern matchers, returning "timeout" (transport
   failure) so retry/fallback handles them correctly.
2. classifyProviderRuntimeFailureKind now detects HTML errors at
   any status (not just 403), returning "upstream_html" for
   non-403 statuses with a clear user-facing message about
   CDN/gateway errors.

Adds regression tests covering Cloudflare 502/503 HTML with
embedded rate-limit text, 403 HTML (still classified as auth),
and JSON rate-limit responses (still classified correctly).

* fix: preserve auth and proxy HTML classification

* fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
…hanks @stainlu)

* fix(agents): classify Cloudflare/CDN HTML error pages as transport failures

Fixes openclaw#67517

When a provider endpoint returns an HTML error page (e.g. Cloudflare
502/503/520-524), the pattern-based message classifiers would scan
the HTML body and misinterpret embedded text like "Rate limit
exceeded" as a structured rate_limit API error. This caused
incorrect failover behavior (profile rotation instead of clean
retry/fallback) and left the TUI stuck.

Two fixes:
1. classifyFailoverSignal now short-circuits on HTML responses
   before running pattern matchers, returning "timeout" (transport
   failure) so retry/fallback handles them correctly.
2. classifyProviderRuntimeFailureKind now detects HTML errors at
   any status (not just 403), returning "upstream_html" for
   non-403 statuses with a clear user-facing message about
   CDN/gateway errors.

Adds regression tests covering Cloudflare 502/503 HTML with
embedded rate-limit text, 403 HTML (still classified as auth),
and JSON rate-limit responses (still classified correctly).

* fix: preserve auth and proxy HTML classification

* fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
…hanks @stainlu)

* fix(agents): classify Cloudflare/CDN HTML error pages as transport failures

Fixes openclaw#67517

When a provider endpoint returns an HTML error page (e.g. Cloudflare
502/503/520-524), the pattern-based message classifiers would scan
the HTML body and misinterpret embedded text like "Rate limit
exceeded" as a structured rate_limit API error. This caused
incorrect failover behavior (profile rotation instead of clean
retry/fallback) and left the TUI stuck.

Two fixes:
1. classifyFailoverSignal now short-circuits on HTML responses
   before running pattern matchers, returning "timeout" (transport
   failure) so retry/fallback handles them correctly.
2. classifyProviderRuntimeFailureKind now detects HTML errors at
   any status (not just 403), returning "upstream_html" for
   non-403 statuses with a clear user-facing message about
   CDN/gateway errors.

Adds regression tests covering Cloudflare 502/503 HTML with
embedded rate-limit text, 403 HTML (still classified as auth),
and JSON rate-limit responses (still classified correctly).

* fix: preserve auth and proxy HTML classification

* fix: classify HTML provider error pages correctly (openclaw#67642) (thanks @stainlu)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: openai-codex/gpt-5.4 returns Cloudflare HTML and gets misclassified as rate_limit / DNS, leaving TUI stuck

2 participants