fix: decode web fetch legacy charsets by amknight · Pull Request #73513 · openclaw/openclaw

amknight · 2026-04-28T11:06:08Z

Summary

Problem: web_fetch decoded HTTP response bodies as UTF-8 unconditionally. Pages encoded with legacy charsets (Shift_JIS, Big5, GBK, ISO-8859-1, etc.) returned mojibake before extraction.
Why it matters: Readability and the basic HTML fallback never see the original characters, so agents read garbled content from any non-UTF-8 site.
What changed: Decode response bytes once, after charset detection. Order: Content-Type: charset=, BOM, then a bounded 4 KB scan of XML/HTML meta declarations for document-like content types; fall back to UTF-8.
What did NOT change (scope boundary): No new dependencies. Readability/extraction logic, Firecrawl/provider fallback, config schemas, docs, and unrelated cron/timer code are untouched.

Change Type (select all)

Bug fix

Scope (select all touched areas)

Skills / tool execution

Linked Issue/PR

Closes [Bug]: web_fetch returns mojibake for non-UTF-8 pages #72916
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: readResponseText() in src/agents/tools/web-shared.ts decoded chunks with default new TextDecoder() and the fallback used res.text(), both of which assume UTF-8 regardless of declared charset.
Missing detection / guardrail: No charset detection; no test for non-UTF-8 pages.
Contributing context (if known): Public SDK re-exports readResponseText, so fixing it centrally also benefits provider plugins that decode HTTP bodies for error detail.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
Target test or file: src/agents/tools/web-tools.fetch.test.ts
Scenario the test should lock in: a text/plain; charset=iso-8859-1 body decodes to café, and an HTML body with <meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS"> decodes Shift_JIS 日本語 before extraction.
Why this is the smallest reliable guardrail: exercises both charset paths through the real web_fetch tool composition.
Existing test that already covers this (if any): None.

User-visible / Behavior Changes

web_fetch now returns readable text for pages declared as Shift_JIS, Big5, GBK, ISO-8859-1, etc., instead of mojibake. UTF-8 pages are unchanged.

Diagram

Before:
fetch -> stream chunks -> TextDecoder(utf-8) per chunk -> mojibake -> Readability/basic extraction

After:
fetch -> capped raw bytes -> charset = ContentType ?? BOM ?? meta sniff (4KB) ?? utf-8 -> single TextDecoder(charset).decode(bytes) -> extraction

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: macOS 26.4
Runtime/container: Node 22 (local pnpm worktree)
Model/provider: N/A (tool decode path)
Integration/channel (if any): N/A
Relevant config (redacted): default tools.web.fetch config

Steps

web_fetch({ url: "http://www.aozora.gr.jp/cards/000081/files/46268_23911.html", extractMode: "text" }).
Inspect returned text and title.
Compare to expected Japanese title 宮澤賢治星めぐりの歌.

Expected

Readable Japanese text matching the page.

Actual (before fix)

Mojibake / U+FFFD replacement characters; e.g. title �{�V�� ߂��̉� and 414 replacement chars in body.

Evidence

Failing test/log before + passing after

pnpm test src/agents/tools/web-tools.fetch.test.ts
RUN v4.1.5
Test Files  1 passed (1)
Tests       16 passed (16)

Live repro confirmed against the issue's URL: UTF-8 decode produced 414 U+FFFD chars; Shift_JIS decode of the same bytes produced the expected Japanese title.

Human Verification (required)

Verified scenarios: Shift_JIS HTML page from issue URL decoded correctly; ISO-8859-1 text/plain decoded correctly; existing 14 web_fetch tests still pass; capped/endless stream still truncates.
Edge cases checked: missing charset in Content-Type; meta http-equiv charset; UTF-8 BOM; oversized response cap; non-document content types skip metadata sniff to avoid false positives.
What you did not verify: full suite locally; pnpm check:changed was not run because blacksmith Testbox CLI is not installed in this environment. CI should cover the changed lanes (core, coreTests).

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Risks and Mitigations

Risk: Bogus <meta charset> could mislead detection on attacker-controlled HTML.
- Mitigation: Sniffing is gated to document-like content types or missing content type; the value must match a strict label charset pattern (≤64 chars, [A-Za-z0-9._:-]); unsupported labels fall back to UTF-8 via try/catch around TextDecoder.
Risk: Extra byte buffer could increase memory.
- Mitigation: Bounded by existing maxResponseBytes, with a single concat instead of per-chunk decode.

clawsweeper · 2026-04-28T11:09:32Z

Codex review: keeping this open for maintainer follow-up; there is still a little grit to resolve.

Keep this PR open. It has the protected maintainer label, current main still contains the UTF-8/default decode paths the PR is meant to replace, and the linked bug #72916 remains open. The PR is an active implementation candidate, not obsolete cleanup.

Best possible solution:

Review and land this PR, or an equivalent fix in src/agents/tools/web-shared.ts, with the proposed legacy-charset regression tests. Before merge, resolve the bounded-read/security review concern and clean up or intentionally accept the unrelated Discord changelog entry. After merge, #72916 can close through the linked PR relationship.

What I checked:

protected maintainer label: Live issue metadata reports this PR as open with labels agents, maintainer, and size: M; the protected maintainer label blocks automated closure even though live author association now reports CONTRIBUTOR.
current main stream decode still defaults to UTF-8: readResponseText constructs new TextDecoder() with no charset before decoding streamed chunks, matching the root cause described by the PR and [Bug]: web_fetch returns mojibake for non-UTF-8 pages #72916. (src/agents/tools/web-shared.ts:116, e4ff7c162044)
current main fallback still uses Response.text: The non-stream fallback still calls await res.text(), so current main retains the second default UTF-8 path called out by the PR body. (src/agents/tools/web-shared.ts:165, e4ff7c162044)
web_fetch uses the shared reader: Successful web_fetch responses are read through readResponseText(res, { maxBytes: params.maxResponseBytes }), so the shared decoding behavior affects user-visible fetch output before extraction. (src/agents/tools/web-fetch.ts:481, e4ff7c162044)
main lacks legacy charset regression coverage: Searches for Shift_JIS, iso-8859, Big5, GBK, mojibake, and charset in the relevant tool tests show only existing UTF-8 fixtures on main, not the PR's legacy-charset cases. (src/agents/tools/web-tools.fetch.test.ts:42, e4ff7c162044)
linked bug remains open and matches this fix: Live metadata for [Bug]: web_fetch returns mojibake for non-UTF-8 pages #72916 shows it is open and describes the same default TextDecoder() / res.text() mojibake problem for Shift_JIS and other non-UTF-8 pages.

Remaining risk / open question:

The PR changes decoding for attacker-controlled HTTP response bodies; reviewer focus should stay on charset-label validation, fallback behavior for unsupported labels, and preserving maxResponseBytes on every read path.
The diff includes an unrelated CHANGELOG.md entry for Discord fix: Discord read/search timeout, session-key fallback, and gateway execution mode #73521 in a charset-focused PR. That should be removed, split, or explicitly accepted before landing.

Codex review notes: model gpt-5.5, reasoning high; reviewed against e4ff7c162044.

greptile-apps · 2026-04-28T11:21:00Z

Greptile Summary

This PR fixes web_fetch incorrectly decoding non-UTF-8 HTTP response bodies as UTF-8, causing mojibake for pages encoded with Shift_JIS, Big5, GBK, ISO-8859-1, and similar legacy charsets. The fix buffers raw response bytes and applies a detection cascade — Content-Type: charset=, BOM, then a bounded 4 KB meta-tag sniff — before a single TextDecoder decode, with a UTF-8 fallback. Two new tests cover the ISO-8859-1 and Shift_JIS paths explicitly.

Confidence Score: 5/5

Safe to merge — the change is a well-scoped, backward-compatible bug fix with no security concerns.

The implementation is logically correct: detection priority (HTTP header → BOM → meta sniff → UTF-8 fallback) matches the WHATWG/HTML spec intent; the module-level TextDecoder singletons are stateless across non-streaming calls so reuse is safe; normalizeCharset guards prevent accepting malformed or oversized labels; the try/catch around new TextDecoder(charset) correctly handles unsupported but syntactically valid labels; and concatBytes correctly tracks byte counts through truncation. The two new tests exercise both detection paths. No P0 or P1 issues were found.

No files require special attention.

_{Reviews (1): Last reviewed commit: "fix: decode web fetch legacy charsets" | Re-trigger Greptile}

aisle-research-bot · 2026-04-28T11:54:06Z

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

#	Severity	Title
1	🟠 High	readResponseText() ignores maxBytes when falling back to Response.arrayBuffer()/text(), enabling memory DoS

1. 🟠 readResponseText() ignores maxBytes when falling back to Response.arrayBuffer()/text(), enabling memory DoS

Property	Value
Severity	High
CWE	CWE-400
Location	`src/agents/tools/web-shared.ts:271-288`

Description

The readResponseText() helper enforces maxBytes only when res.body.getReader() is available. If the response body is not a WHATWG stream (no getReader), the function falls back to res.arrayBuffer() (and then res.text()), which reads the entire response into memory regardless of options.maxBytes.

This allows a remote server (attacker-controlled URL/content) to return an arbitrarily large response body that will be fully buffered in memory, defeating configured safeguards such as params.maxResponseBytes in web-fetch.ts.

Vulnerable code (maxBytes bypass in fallback):

const readBytes = (res as { arrayBuffer?: () => Promise<ArrayBuffer> }).arrayBuffer;
if (typeof readBytes === "function") {
  const bytes = new Uint8Array(await readBytes.call(res));
  return {
    text: decodeResponseBytes(res, bytes),
    truncated: false,
    bytesRead: bytes.byteLength,
  };
}

const text = await res.text();
return { text, truncated: false, bytesRead: text.length };

Impact:

Memory/CPU denial of service when fetching from untrusted URLs in environments/polyfills/mocks where body.getReader is absent but arrayBuffer() (or text()) is present.
truncated is incorrectly reported as false, so downstream code won’t warn that the safety cap was exceeded.

Recommendation

Enforce maxBytes in all code paths.

Option A (preferred): when maxBytes is set and getReader() is unavailable, avoid arrayBuffer()/text() and instead fail fast or use a capped read from a Node stream.

Option B: if you must use arrayBuffer()/text(), cap the amount retained/decoded and correctly mark truncation.

Example fix (capped arrayBuffer path):

const readBytes = (res as { arrayBuffer?: () => Promise<ArrayBuffer> }).arrayBuffer;
if (typeof readBytes === "function") {
  const buf = new Uint8Array(await readBytes.call(res));
  if (maxBytes && buf.byteLength > maxBytes) {
    const capped = buf.subarray(0, maxBytes);
    return {
      text: decodeResponseBytes(res, capped),
      truncated: true,
      bytesRead: maxBytes,
    };
  }
  return {
    text: decodeResponseBytes(res, buf),
    truncated: false,
    bytesRead: buf.byteLength,
  };
}

// Similarly cap res.text() when maxBytes is set, or avoid calling it for unbounded reads.

Also consider propagating an explicit error when size exceeds maxBytes so callers can abort processing safely.

Analyzed PR: #73513 at commit a6943a0

_{Last updated on: 2026-04-28T11:55:49Z}

* fix: decode web fetch legacy charsets

openclaw-barnacle Bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels Apr 28, 2026

amknight marked this pull request as ready for review April 28, 2026 11:17

fix: decode web fetch legacy charsets

d4822a8

amknight force-pushed the ak/issue-72916-investigation branch from 68e9547 to d4822a8 Compare April 28, 2026 11:46

openclaw-barnacle Bot added size: M and removed size: S labels Apr 28, 2026

docs: add discord timeout changelog entry

a6943a0

amknight merged commit 7a23b2d into main Apr 28, 2026
67 of 69 checks passed

amknight deleted the ak/issue-72916-investigation branch April 28, 2026 12:09

github-actions Bot mentioned this pull request Apr 28, 2026

📡 Upstream Digest — 2026-04-28 16:04 UTC curtismercier/openclaw-mods#708

Open

github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026

fix: decode web fetch legacy charsets (openclaw#73513)

1e11300

* fix: decode web fetch legacy charsets

github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026

fix: decode web fetch legacy charsets (openclaw#73513)

aa98e41

* fix: decode web fetch legacy charsets

jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026

fix: decode web fetch legacy charsets (openclaw#73513)

a2cd34b

* fix: decode web fetch legacy charsets

sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026

fix: decode web fetch legacy charsets (openclaw#73513)

2b1afa8

* fix: decode web fetch legacy charsets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: decode web fetch legacy charsets#73513

fix: decode web fetch legacy charsets#73513
amknight merged 2 commits into
mainfrom
ak/issue-72916-investigation

amknight commented Apr 28, 2026

Uh oh!

clawsweeper Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Uh oh!

aisle-research-bot Bot commented Apr 28, 2026 •

edited

Loading

Description

Recommendation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

amknight commented Apr 28, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual (before fix)

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Uh oh!

clawsweeper Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

aisle-research-bot Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔒 Aisle Security Analysis

Description

Recommendation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

clawsweeper Bot commented Apr 28, 2026 •

edited

Loading

aisle-research-bot Bot commented Apr 28, 2026 •

edited

Loading