fix: decode web fetch legacy charsets#73513
Conversation
|
Codex review: keeping this open for maintainer follow-up; there is still a little grit to resolve. Keep this PR open. It has the protected Best possible solution: Review and land this PR, or an equivalent fix in What I checked:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against e4ff7c162044. |
Greptile SummaryThis PR fixes Confidence Score: 5/5Safe to merge — the change is a well-scoped, backward-compatible bug fix with no security concerns. The implementation is logically correct: detection priority (HTTP header → BOM → meta sniff → UTF-8 fallback) matches the WHATWG/HTML spec intent; the module-level TextDecoder singletons are stateless across non-streaming calls so reuse is safe; normalizeCharset guards prevent accepting malformed or oversized labels; the try/catch around new TextDecoder(charset) correctly handles unsupported but syntactically valid labels; and concatBytes correctly tracks byte counts through truncation. The two new tests exercise both detection paths. No P0 or P1 issues were found. No files require special attention. Reviews (1): Last reviewed commit: "fix: decode web fetch legacy charsets" | Re-trigger Greptile |
68e9547 to
d4822a8
Compare
🔒 Aisle Security AnalysisWe found 1 potential security issue(s) in this PR:
1. 🟠 readResponseText() ignores maxBytes when falling back to Response.arrayBuffer()/text(), enabling memory DoS
DescriptionThe This allows a remote server (attacker-controlled URL/content) to return an arbitrarily large response body that will be fully buffered in memory, defeating configured safeguards such as Vulnerable code (maxBytes bypass in fallback): const readBytes = (res as { arrayBuffer?: () => Promise<ArrayBuffer> }).arrayBuffer;
if (typeof readBytes === "function") {
const bytes = new Uint8Array(await readBytes.call(res));
return {
text: decodeResponseBytes(res, bytes),
truncated: false,
bytesRead: bytes.byteLength,
};
}
const text = await res.text();
return { text, truncated: false, bytesRead: text.length };Impact:
RecommendationEnforce Option A (preferred): when Option B: if you must use Example fix (capped arrayBuffer path): const readBytes = (res as { arrayBuffer?: () => Promise<ArrayBuffer> }).arrayBuffer;
if (typeof readBytes === "function") {
const buf = new Uint8Array(await readBytes.call(res));
if (maxBytes && buf.byteLength > maxBytes) {
const capped = buf.subarray(0, maxBytes);
return {
text: decodeResponseBytes(res, capped),
truncated: true,
bytesRead: maxBytes,
};
}
return {
text: decodeResponseBytes(res, buf),
truncated: false,
bytesRead: buf.byteLength,
};
}
// Similarly cap res.text() when maxBytes is set, or avoid calling it for unbounded reads.Also consider propagating an explicit error when size exceeds Analyzed PR: #73513 at commit Last updated on: 2026-04-28T11:55:49Z |
* fix: decode web fetch legacy charsets
* fix: decode web fetch legacy charsets
* fix: decode web fetch legacy charsets
* fix: decode web fetch legacy charsets
Summary
web_fetchdecoded HTTP response bodies as UTF-8 unconditionally. Pages encoded with legacy charsets (Shift_JIS, Big5, GBK, ISO-8859-1, etc.) returned mojibake before extraction.Content-Type: charset=, BOM, then a bounded 4 KB scan of XML/HTML meta declarations for document-like content types; fall back to UTF-8.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Root Cause (if applicable)
readResponseText()insrc/agents/tools/web-shared.tsdecoded chunks with defaultnew TextDecoder()and the fallback usedres.text(), both of which assume UTF-8 regardless of declared charset.readResponseText, so fixing it centrally also benefits provider plugins that decode HTTP bodies for error detail.Regression Test Plan (if applicable)
src/agents/tools/web-tools.fetch.test.tstext/plain; charset=iso-8859-1body decodes tocafé, and an HTML body with<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">decodes Shift_JIS日本語before extraction.web_fetchtool composition.User-visible / Behavior Changes
web_fetchnow returns readable text for pages declared as Shift_JIS, Big5, GBK, ISO-8859-1, etc., instead of mojibake. UTF-8 pages are unchanged.Diagram
Security Impact (required)
NoNoNoNoNoRepro + Verification
Environment
tools.web.fetchconfigSteps
web_fetch({ url: "http://www.aozora.gr.jp/cards/000081/files/46268_23911.html", extractMode: "text" }).textandtitle.宮澤賢治 星めぐりの歌.Expected
Actual (before fix)
�{�V���� ���߂���̉�and 414 replacement chars in body.Evidence
Live repro confirmed against the issue's URL: UTF-8 decode produced 414 U+FFFD chars; Shift_JIS decode of the same bytes produced the expected Japanese title.
Human Verification (required)
text/plaindecoded correctly; existing 14 web_fetch tests still pass; capped/endless stream still truncates.charsetinContent-Type; metahttp-equivcharset; UTF-8 BOM; oversized response cap; non-document content types skip metadata sniff to avoid false positives.pnpm check:changedwas not run becauseblacksmithTestbox CLI is not installed in this environment. CI should cover the changed lanes (core,coreTests).Review Conversations
Compatibility / Migration
YesNoNoRisks and Mitigations
<meta charset>could mislead detection on attacker-controlled HTML.[A-Za-z0-9._:-]); unsupported labels fall back to UTF-8 viatry/catcharoundTextDecoder.maxResponseBytes, with a single concat instead of per-chunk decode.