Skip to content

fix(web-fetch): detect response charset from Content-Type and HTML meta#8

Open
suboss87 wants to merge 58 commits into
mainfrom
fix/web-fetch-charset-detection
Open

fix(web-fetch): detect response charset from Content-Type and HTML meta#8
suboss87 wants to merge 58 commits into
mainfrom
fix/web-fetch-charset-detection

Conversation

@suboss87

@suboss87 suboss87 commented Apr 28, 2026

Copy link
Copy Markdown
Owner

Summary

  • web_fetch decoded all HTTP response bodies as UTF-8 unconditionally, producing mojibake for legacy-charset pages (Shift_JIS, Big5, GBK, ISO-8859-1, etc.)
  • Root cause: readResponseText() used new TextDecoder() (UTF-8) in the streaming path and res.text() (also UTF-8 per WHATWG Fetch spec) in the non-streaming fallback -- neither respects the declared charset
  • Fix: collect raw bytes before decoding; resolve charset from the Content-Type: charset= parameter; if absent and content is HTML, scan the first 4 KB for a <meta charset> or <meta http-equiv="Content-Type" content="...charset=..."> declaration; decode with TextDecoder(detectedCharset), falling back to UTF-8 for unknown/missing labels

Files changed

  • src/agents/tools/web-shared.ts -- charset helpers + reworked streaming and fallback decode paths
  • src/agents/tools/web-shared.charset.test.ts -- 7 new regression tests (all pass)

Test plan

  • 7 new tests in web-shared.charset.test.ts covering: Content-Type charset, HTML meta charset, http-equiv meta, UTF-8 fallback, non-HTML content, maxBytes truncation with charset
  • All 26 existing web-fetch tests still pass
  • pnpm check clean

Closes openclaw#72916


Generated by Claude Code


Open in Devin Review

claude added 30 commits April 1, 2026 15:34
claude and others added 27 commits April 18, 2026 03:46
Before this fix, if onTimer rejected unexpectedly (e.g. a Node.js
internal error or GC pressure causing an exception in the finally
block's armTimer call), the .catch() handler only logged the error.
The scheduler chain was then permanently broken with no timer set,
silently halting all cron jobs until the next gateway restart.

Fix: call armTimer(state) inside the .catch() handler so a rare
unexpected rejection does not permanently stop the scheduler.

Regression test exercises the path by making nowMs() throw on the
4th call (inside the finally block's armTimer), which causes onTimer
to reject; the .catch() re-arm is then verified via state.timer.

Closes openclaw#73166.

https://claude.ai/code/session_01NHHoPHTrH4F9qFJBJHqjTk
web_fetch decoded all HTTP response bodies as UTF-8 unconditionally.
readResponseText() used `new TextDecoder()` (UTF-8) in the streaming
path and `res.text()` (also UTF-8 per WHATWG Fetch spec) in the
fallback path, causing mojibake for legacy-charset pages such as
Shift_JIS, Big5, and ISO-8859-1.

Fix: in the streaming path, collect raw bytes before decoding and
resolve the charset from the Content-Type `charset=` parameter; if
absent and the response is HTML, scan the first 4 KB for a
`<meta charset>` or http-equiv declaration, then decode with
`TextDecoder(detectedCharset)`. The non-streaming fallback uses
`arrayBuffer()` + the same charset resolution for environments that
expose it, and retains the old `text()` path as a last resort.

Closes openclaw#72916.

https://claude.ai/code/session_01NHHoPHTrH4F9qFJBJHqjTk
devin-ai-integration[bot]

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: web_fetch returns mojibake for non-UTF-8 pages

2 participants