Skip to content

fix: decode web fetch legacy charsets#73513

Merged
amknight merged 2 commits into
mainfrom
ak/issue-72916-investigation
Apr 28, 2026
Merged

fix: decode web fetch legacy charsets#73513
amknight merged 2 commits into
mainfrom
ak/issue-72916-investigation

Conversation

@amknight

Copy link
Copy Markdown
Member

Summary

  • Problem: web_fetch decoded HTTP response bodies as UTF-8 unconditionally. Pages encoded with legacy charsets (Shift_JIS, Big5, GBK, ISO-8859-1, etc.) returned mojibake before extraction.
  • Why it matters: Readability and the basic HTML fallback never see the original characters, so agents read garbled content from any non-UTF-8 site.
  • What changed: Decode response bytes once, after charset detection. Order: Content-Type: charset=, BOM, then a bounded 4 KB scan of XML/HTML meta declarations for document-like content types; fall back to UTF-8.
  • What did NOT change (scope boundary): No new dependencies. Readability/extraction logic, Firecrawl/provider fallback, config schemas, docs, and unrelated cron/timer code are untouched.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Skills / tool execution

Linked Issue/PR

Root Cause (if applicable)

  • Root cause: readResponseText() in src/agents/tools/web-shared.ts decoded chunks with default new TextDecoder() and the fallback used res.text(), both of which assume UTF-8 regardless of declared charset.
  • Missing detection / guardrail: No charset detection; no test for non-UTF-8 pages.
  • Contributing context (if known): Public SDK re-exports readResponseText, so fixing it centrally also benefits provider plugins that decode HTTP bodies for error detail.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
  • Target test or file: src/agents/tools/web-tools.fetch.test.ts
  • Scenario the test should lock in: a text/plain; charset=iso-8859-1 body decodes to café, and an HTML body with <meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS"> decodes Shift_JIS 日本語 before extraction.
  • Why this is the smallest reliable guardrail: exercises both charset paths through the real web_fetch tool composition.
  • Existing test that already covers this (if any): None.

User-visible / Behavior Changes

  • web_fetch now returns readable text for pages declared as Shift_JIS, Big5, GBK, ISO-8859-1, etc., instead of mojibake. UTF-8 pages are unchanged.

Diagram

Before:
fetch -> stream chunks -> TextDecoder(utf-8) per chunk -> mojibake -> Readability/basic extraction

After:
fetch -> capped raw bytes -> charset = ContentType ?? BOM ?? meta sniff (4KB) ?? utf-8 -> single TextDecoder(charset).decode(bytes) -> extraction

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS 26.4
  • Runtime/container: Node 22 (local pnpm worktree)
  • Model/provider: N/A (tool decode path)
  • Integration/channel (if any): N/A
  • Relevant config (redacted): default tools.web.fetch config

Steps

  1. web_fetch({ url: "http://www.aozora.gr.jp/cards/000081/files/46268_23911.html", extractMode: "text" }).
  2. Inspect returned text and title.
  3. Compare to expected Japanese title 宮澤賢治 星めぐりの歌.

Expected

  • Readable Japanese text matching the page.

Actual (before fix)

  • Mojibake / U+FFFD replacement characters; e.g. title �{�V���� ���߂���̉� and 414 replacement chars in body.

Evidence

  • Failing test/log before + passing after
pnpm test src/agents/tools/web-tools.fetch.test.ts
RUN v4.1.5
Test Files  1 passed (1)
Tests       16 passed (16)

Live repro confirmed against the issue's URL: UTF-8 decode produced 414 U+FFFD chars; Shift_JIS decode of the same bytes produced the expected Japanese title.

Human Verification (required)

  • Verified scenarios: Shift_JIS HTML page from issue URL decoded correctly; ISO-8859-1 text/plain decoded correctly; existing 14 web_fetch tests still pass; capped/endless stream still truncates.
  • Edge cases checked: missing charset in Content-Type; meta http-equiv charset; UTF-8 BOM; oversized response cap; non-document content types skip metadata sniff to avoid false positives.
  • What you did not verify: full suite locally; pnpm check:changed was not run because blacksmith Testbox CLI is not installed in this environment. CI should cover the changed lanes (core, coreTests).

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: Bogus <meta charset> could mislead detection on attacker-controlled HTML.
    • Mitigation: Sniffing is gated to document-like content types or missing content type; the value must match a strict label charset pattern (≤64 chars, [A-Za-z0-9._:-]); unsupported labels fall back to UTF-8 via try/catch around TextDecoder.
  • Risk: Extra byte buffer could increase memory.
    • Mitigation: Bounded by existing maxResponseBytes, with a single concat instead of per-chunk decode.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels Apr 28, 2026
@clawsweeper

clawsweeper Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Codex review: keeping this open for maintainer follow-up; there is still a little grit to resolve.

Keep this PR open. It has the protected maintainer label, current main still contains the UTF-8/default decode paths the PR is meant to replace, and the linked bug #72916 remains open. The PR is an active implementation candidate, not obsolete cleanup.

Best possible solution:

Review and land this PR, or an equivalent fix in src/agents/tools/web-shared.ts, with the proposed legacy-charset regression tests. Before merge, resolve the bounded-read/security review concern and clean up or intentionally accept the unrelated Discord changelog entry. After merge, #72916 can close through the linked PR relationship.

What I checked:

  • protected maintainer label: Live issue metadata reports this PR as open with labels agents, maintainer, and size: M; the protected maintainer label blocks automated closure even though live author association now reports CONTRIBUTOR.
  • current main stream decode still defaults to UTF-8: readResponseText constructs new TextDecoder() with no charset before decoding streamed chunks, matching the root cause described by the PR and [Bug]: web_fetch returns mojibake for non-UTF-8 pages #72916. (src/agents/tools/web-shared.ts:116, e4ff7c162044)
  • current main fallback still uses Response.text: The non-stream fallback still calls await res.text(), so current main retains the second default UTF-8 path called out by the PR body. (src/agents/tools/web-shared.ts:165, e4ff7c162044)
  • web_fetch uses the shared reader: Successful web_fetch responses are read through readResponseText(res, { maxBytes: params.maxResponseBytes }), so the shared decoding behavior affects user-visible fetch output before extraction. (src/agents/tools/web-fetch.ts:481, e4ff7c162044)
  • main lacks legacy charset regression coverage: Searches for Shift_JIS, iso-8859, Big5, GBK, mojibake, and charset in the relevant tool tests show only existing UTF-8 fixtures on main, not the PR's legacy-charset cases. (src/agents/tools/web-tools.fetch.test.ts:42, e4ff7c162044)
  • linked bug remains open and matches this fix: Live metadata for [Bug]: web_fetch returns mojibake for non-UTF-8 pages #72916 shows it is open and describes the same default TextDecoder() / res.text() mojibake problem for Shift_JIS and other non-UTF-8 pages.

Remaining risk / open question:

  • The PR changes decoding for attacker-controlled HTTP response bodies; reviewer focus should stay on charset-label validation, fallback behavior for unsupported labels, and preserving maxResponseBytes on every read path.
  • The diff includes an unrelated CHANGELOG.md entry for Discord fix: Discord read/search timeout, session-key fallback, and gateway execution mode #73521 in a charset-focused PR. That should be removed, split, or explicitly accepted before landing.

Codex review notes: model gpt-5.5, reasoning high; reviewed against e4ff7c162044.

@amknight amknight marked this pull request as ready for review April 28, 2026 11:17
@greptile-apps

greptile-apps Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes web_fetch incorrectly decoding non-UTF-8 HTTP response bodies as UTF-8, causing mojibake for pages encoded with Shift_JIS, Big5, GBK, ISO-8859-1, and similar legacy charsets. The fix buffers raw response bytes and applies a detection cascade — Content-Type: charset=, BOM, then a bounded 4 KB meta-tag sniff — before a single TextDecoder decode, with a UTF-8 fallback. Two new tests cover the ISO-8859-1 and Shift_JIS paths explicitly.

Confidence Score: 5/5

Safe to merge — the change is a well-scoped, backward-compatible bug fix with no security concerns.

The implementation is logically correct: detection priority (HTTP header → BOM → meta sniff → UTF-8 fallback) matches the WHATWG/HTML spec intent; the module-level TextDecoder singletons are stateless across non-streaming calls so reuse is safe; normalizeCharset guards prevent accepting malformed or oversized labels; the try/catch around new TextDecoder(charset) correctly handles unsupported but syntactically valid labels; and concatBytes correctly tracks byte counts through truncation. The two new tests exercise both detection paths. No P0 or P1 issues were found.

No files require special attention.

Reviews (1): Last reviewed commit: "fix: decode web fetch legacy charsets" | Re-trigger Greptile

@amknight amknight force-pushed the ak/issue-72916-investigation branch from 68e9547 to d4822a8 Compare April 28, 2026 11:46
@aisle-research-bot

aisle-research-bot Bot commented Apr 28, 2026

Copy link
Copy Markdown

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

# Severity Title
1 🟠 High readResponseText() ignores maxBytes when falling back to Response.arrayBuffer()/text(), enabling memory DoS
1. 🟠 readResponseText() ignores maxBytes when falling back to Response.arrayBuffer()/text(), enabling memory DoS
Property Value
Severity High
CWE CWE-400
Location src/agents/tools/web-shared.ts:271-288

Description

The readResponseText() helper enforces maxBytes only when res.body.getReader() is available. If the response body is not a WHATWG stream (no getReader), the function falls back to res.arrayBuffer() (and then res.text()), which reads the entire response into memory regardless of options.maxBytes.

This allows a remote server (attacker-controlled URL/content) to return an arbitrarily large response body that will be fully buffered in memory, defeating configured safeguards such as params.maxResponseBytes in web-fetch.ts.

Vulnerable code (maxBytes bypass in fallback):

const readBytes = (res as { arrayBuffer?: () => Promise<ArrayBuffer> }).arrayBuffer;
if (typeof readBytes === "function") {
  const bytes = new Uint8Array(await readBytes.call(res));
  return {
    text: decodeResponseBytes(res, bytes),
    truncated: false,
    bytesRead: bytes.byteLength,
  };
}

const text = await res.text();
return { text, truncated: false, bytesRead: text.length };

Impact:

  • Memory/CPU denial of service when fetching from untrusted URLs in environments/polyfills/mocks where body.getReader is absent but arrayBuffer() (or text()) is present.
  • truncated is incorrectly reported as false, so downstream code won’t warn that the safety cap was exceeded.

Recommendation

Enforce maxBytes in all code paths.

Option A (preferred): when maxBytes is set and getReader() is unavailable, avoid arrayBuffer()/text() and instead fail fast or use a capped read from a Node stream.

Option B: if you must use arrayBuffer()/text(), cap the amount retained/decoded and correctly mark truncation.

Example fix (capped arrayBuffer path):

const readBytes = (res as { arrayBuffer?: () => Promise<ArrayBuffer> }).arrayBuffer;
if (typeof readBytes === "function") {
  const buf = new Uint8Array(await readBytes.call(res));
  if (maxBytes && buf.byteLength > maxBytes) {
    const capped = buf.subarray(0, maxBytes);
    return {
      text: decodeResponseBytes(res, capped),
      truncated: true,
      bytesRead: maxBytes,
    };
  }
  return {
    text: decodeResponseBytes(res, buf),
    truncated: false,
    bytesRead: buf.byteLength,
  };
}// Similarly cap res.text() when maxBytes is set, or avoid calling it for unbounded reads.

Also consider propagating an explicit error when size exceeds maxBytes so callers can abort processing safely.


Analyzed PR: #73513 at commit a6943a0

Last updated on: 2026-04-28T11:55:49Z

@amknight amknight merged commit 7a23b2d into main Apr 28, 2026
67 of 69 checks passed
@amknight amknight deleted the ak/issue-72916-investigation branch April 28, 2026 12:09
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
* fix: decode web fetch legacy charsets
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
* fix: decode web fetch legacy charsets
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
* fix: decode web fetch legacy charsets
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
* fix: decode web fetch legacy charsets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: web_fetch returns mojibake for non-UTF-8 pages

1 participant