handle whatwg encoding standard overrides#6091
Conversation
ChALkeR
left a comment
There was a problem hiding this comment.
Does this pass on full wpt ranges now?
If not, this is likely too complicated for a half-fix
It should pass now. We have a bug in our test harness that stops us from loading big5 table that WPT expects. Fixing it with this pull-request. |
|
@anonrig most of WPT tests for legacy multibyte are in html form. It's a stand-alone function basically (remove You want to run these tests from WPT: This won't cover streaming, but would cover character ranges |
This comment was marked as off-topic.
This comment was marked as off-topic.
6525e3e to
a54fab8
Compare
|
@anonrig I'm unsure if the solution shouldn't be "drop ICU-based impl and copy over the code from @exodus/bytes instead" I'm going to test that on Node.js shortly That depends on whether the approach in this PR works (including on streaming) and the perf of ICU here though. |
ee14ac5 to
c2927c4
Compare
|
The generated output of |
23e69fb to
12792c2
Compare
272a120 to
3c0ee73
Compare
|
C++ and JS parts look good. Looked over the rust parts but feel more comfortable deferring to someone like @mikea on signoff of the rust bits. |
|
I'm trying out a PR review agent locally... expect some code review comments from it shortly |
jasnell
left a comment
There was a problem hiding this comment.
Architectural Review (AI-generated)
Overall this is a solid direction — replacing ICU with encoding_rs for WHATWG-compliant CJK decoding is well-motivated and the CXX bridge follows established patterns in the codebase. The test coverage is thorough and the WPT infrastructure fixes are valuable.
Inline comments below cover the main concerns. The two highest-priority items are: (1) an unconditional performance regression for windows-1252/x-user-defined, and (2) an inconsistent reset contract between decode() and decodePtr() that could cause issues for TextDecoderStream with LegacyDecoder.
0cc8d2d to
0734371
Compare
0734371 to
b1efc7a
Compare
|
/bonk review this change. |
|
I read all 18 changed files in the PR (Rust bridge code, C++ decoder implementations, shared headers, encoding.c++, encoding.h, test files, compat flag definitions, BUILD files, and the TextDecoderStream usage in streams/encoding.c++) and cross-referenced the review comments to understand which concerns had been addressed, then analyzed the compat flag gating logic, reset/flush contracts between IcuDecoder and LegacyDecoder, and the encoding routing in the constructor to produce a structured review identifying one high-priority issue (ungated windows-1252/x-user-defined path change), two medium issues, and several nits. |
jasnell
left a comment
There was a problem hiding this comment.
Ok, went back through the rust bits. I agree with @harrishancock's comments but otherwise LGTM
91fae46 to
ad87983
Compare
ad87983 to
0d654ff
Compare
0d654ff to
6460a27
Compare
|
will test this once a release is published |
|
This fails in the exact place that I warned about in #6091 (comment) / #6091 (comment) It should not have been discarded Apparently, you do not have those tests Other than that, pretty nice! Upd: opened a new issue: #6193 |
|
Overall, nice work migrating to |
Fixes #6038
Fixes #5381
Fixes #5380
The solution became a lot more complicated then I wanted it to be, but unfortunately couldn't find a "more maintainable" version of the solution.
Open to suggestions!
TLDR: ICU implementation for these codec are not unsafe and these codecs not used at all on ICU by Chromium: https://chromium.googlesource.com/chromium/src/third_party/+/refs/heads/main/blink/renderer/platform/wtf/text/text_codec_cjk.cc