Skip to content

handle whatwg encoding standard overrides#6091

Merged
anonrig merged 3 commits intomainfrom
yagiz/fix-textdecoder
Feb 24, 2026
Merged

handle whatwg encoding standard overrides#6091
anonrig merged 3 commits intomainfrom
yagiz/fix-textdecoder

Conversation

@anonrig
Copy link
Copy Markdown
Member

@anonrig anonrig commented Feb 17, 2026

Fixes #6038
Fixes #5381
Fixes #5380

The solution became a lot more complicated then I wanted it to be, but unfortunately couldn't find a "more maintainable" version of the solution.

Open to suggestions!


TLDR: ICU implementation for these codec are not unsafe and these codecs not used at all on ICU by Chromium: https://chromium.googlesource.com/chromium/src/third_party/+/refs/heads/main/blink/renderer/platform/wtf/text/text_codec_cjk.cc

@anonrig anonrig requested a review from jasnell February 17, 2026 16:48
@anonrig anonrig requested review from a team as code owners February 17, 2026 16:48
Copy link
Copy Markdown

@ChALkeR ChALkeR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this pass on full wpt ranges now?
If not, this is likely too complicated for a half-fix

@anonrig
Copy link
Copy Markdown
Member Author

anonrig commented Feb 17, 2026

Does this pass on full wpt ranges now?

It should pass now. We have a bug in our test harness that stops us from loading big5 table that WPT expects. Fixing it with this pull-request.

@ChALkeR
Copy link
Copy Markdown

ChALkeR commented Feb 17, 2026

@anonrig most of WPT tests for legacy multibyte are in html form.
You can use this function to run them without complex logic: https://github.com/ExodusOSS/bytes/blob/main/tests/wpt/loader.cjs#L137

It's a stand-alone function basically (remove encode test)

You want to run these tests from WPT:

encoding/legacy-mb-tchinese/big5/big5_chars-csbig5.html
encoding/legacy-mb-tchinese/big5/big5_chars_extra.html
encoding/legacy-mb-tchinese/big5/big5_chars-big5-hkscs.html
encoding/legacy-mb-tchinese/big5/big5_errors.html
encoding/legacy-mb-tchinese/big5/big5_chars-x-x-big5.html
encoding/legacy-mb-tchinese/big5/big5_chars.html
encoding/legacy-mb-tchinese/big5/big5_chars-cn-big5.html
encoding/legacy-mb-japanese/shift_jis/sjis_errors.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-ms932.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-shift-jis.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-csshiftjis.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-sjis.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-ms_kanji.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-windows-31j.html
encoding/legacy-mb-japanese/shift_jis/sjis_chars-x-sjis.html
encoding/legacy-mb-japanese/iso-2022-jp/iso2022jp_errors.html
encoding/legacy-mb-japanese/iso-2022-jp/iso2022jp_chars-csiso2022jp.html
encoding/legacy-mb-japanese/iso-2022-jp/iso2022jp_chars.html
encoding/legacy-mb-japanese/euc-jp/eucjp_errors.html
encoding/legacy-mb-japanese/euc-jp/eucjp_chars-cseucpkdfmtjapanese.html
encoding/legacy-mb-japanese/euc-jp/eucjp_chars-x-euc-jp.html
encoding/legacy-mb-japanese/euc-jp/eucjp_chars.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-csksc56011987.html
encoding/legacy-mb-korean/euc-kr/euckr_chars.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-ks_c_5601-1989.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-korean.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-iso-ir-149.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-windows-949.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-cseuckr.html
encoding/legacy-mb-korean/euc-kr/euckr_errors.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-ksc_5601.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-ks_c_5601-1987.html
encoding/legacy-mb-korean/euc-kr/euckr_chars-ksc5601.html

This won't cover streaming, but would cover character ranges

@codspeed-hq

This comment was marked as off-topic.

@anonrig anonrig force-pushed the yagiz/fix-textdecoder branch from 6525e3e to a54fab8 Compare February 17, 2026 18:29
@ChALkeR
Copy link
Copy Markdown

ChALkeR commented Feb 17, 2026

@anonrig I'm unsure if the solution shouldn't be "drop ICU-based impl and copy over the code from @exodus/bytes instead"

I'm going to test that on Node.js shortly

That depends on whether the approach in this PR works (including on streaming) and the perf of ICU here though.

@anonrig anonrig force-pushed the yagiz/fix-textdecoder branch from ee14ac5 to c2927c4 Compare February 18, 2026 19:33
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 18, 2026

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

@anonrig anonrig force-pushed the yagiz/fix-textdecoder branch from 23e69fb to 12792c2 Compare February 18, 2026 20:37
@anonrig anonrig force-pushed the yagiz/fix-textdecoder branch 3 times, most recently from 272a120 to 3c0ee73 Compare February 19, 2026 22:39
@anonrig anonrig requested a review from jasnell February 19, 2026 22:39
@jasnell
Copy link
Copy Markdown
Collaborator

jasnell commented Feb 20, 2026

C++ and JS parts look good. Looked over the rust parts but feel more comfortable deferring to someone like @mikea on signoff of the rust bits.

@jasnell
Copy link
Copy Markdown
Collaborator

jasnell commented Feb 21, 2026

I'm trying out a PR review agent locally... expect some code review comments from it shortly

Copy link
Copy Markdown
Collaborator

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architectural Review (AI-generated)

Overall this is a solid direction — replacing ICU with encoding_rs for WHATWG-compliant CJK decoding is well-motivated and the CXX bridge follows established patterns in the codebase. The test coverage is thorough and the WPT infrastructure fixes are valuable.

Inline comments below cover the main concerns. The two highest-priority items are: (1) an unconditional performance regression for windows-1252/x-user-defined, and (2) an inconsistent reset contract between decode() and decodePtr() that could cause issues for TextDecoderStream with LegacyDecoder.

@anonrig
Copy link
Copy Markdown
Member Author

anonrig commented Feb 23, 2026

cc @mikea @jasnell can you take a look? this is ready to merge.

Copy link
Copy Markdown
Collaborator

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the C++ bits. Will leave rust bits for @mikea

@anonrig
Copy link
Copy Markdown
Member Author

anonrig commented Feb 23, 2026

/bonk review this change.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Feb 23, 2026

I read all 18 changed files in the PR (Rust bridge code, C++ decoder implementations, shared headers, encoding.c++, encoding.h, test files, compat flag definitions, BUILD files, and the TextDecoderStream usage in streams/encoding.c++) and cross-referenced the review comments to understand which concerns had been addressed, then analyzed the compat flag gating logic, reset/flush contracts between IcuDecoder and LegacyDecoder, and the encoding routing in the constructor to produce a structured review identifying one high-priority issue (ungated windows-1252/x-user-defined path change), two medium issues, and several nits.

github run

Copy link
Copy Markdown
Collaborator

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, went back through the rust bits. I agree with @harrishancock's comments but otherwise LGTM

@anonrig anonrig force-pushed the yagiz/fix-textdecoder branch 2 times, most recently from 91fae46 to ad87983 Compare February 24, 2026 17:54
@anonrig anonrig force-pushed the yagiz/fix-textdecoder branch from ad87983 to 0d654ff Compare February 24, 2026 17:55
@anonrig anonrig force-pushed the yagiz/fix-textdecoder branch from 0d654ff to 6460a27 Compare February 24, 2026 19:03
@anonrig anonrig merged commit 5080c86 into main Feb 24, 2026
22 of 24 checks passed
@anonrig anonrig deleted the yagiz/fix-textdecoder branch February 24, 2026 21:49
@ChALkeR
Copy link
Copy Markdown

ChALkeR commented Feb 24, 2026

will test this once a release is published

@ChALkeR
Copy link
Copy Markdown

ChALkeR commented Feb 26, 2026

This fails in the exact place that I warned about in #6091 (comment) / #6091 (comment)

It should not have been discarded

Apparently, you do not have those tests

Other than that, pretty nice!

Upd: opened a new issue: #6193

@ChALkeR
Copy link
Copy Markdown

ChALkeR commented Feb 26, 2026

Overall, nice work migrating to encoding_rs! It's one of the best impls out there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

legacy multi-byte encodings in TextDecoder are incorrect Figure out errors relating to GBK encoding Figure out errors relating to GB18030 decoding

5 participants