Skip to content

fix TextDecoder streaming to ignore empty chunks#6212

Merged
anonrig merged 1 commit intomainfrom
yagiz/fix-text-decoder-empty
Feb 27, 2026
Merged

fix TextDecoder streaming to ignore empty chunks#6212
anonrig merged 1 commit intomainfrom
yagiz/fix-text-decoder-empty

Conversation

@anonrig
Copy link
Copy Markdown
Member

@anonrig anonrig commented Feb 27, 2026

If I understood correctly, this should fix #6193

Fixes #6193

cc @ChALkeR

@anonrig anonrig requested a review from jasnell February 27, 2026 21:35
@anonrig anonrig requested review from a team as code owners February 27, 2026 21:35
Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks correct — short-circuiting empty non-flush chunks is the right workaround for the encoding_rs bug. One suggestion to strengthen the test coverage.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Feb 27, 2026

I'm Bonk, and I've done a quick review of your PR.

Summary: Workaround for an encoding_rs bug where empty chunks during streaming mode incorrectly flush decoder state, breaking multi-byte character decoding for legacy CJK encodings.

Findings:

  1. Low — Test coverage only covers one of the affected encodings. The issue reports failures for big5, euc-kr, and shift_jis, but the test only validates shift_jis. I posted a suggestion adding big5 and euc-kr test cases to match the reported scope.

The C++ fix itself is correct and minimal — returning an empty string for empty non-flush buffers avoids tickling the encoding_rs bug while preserving flush semantics (where an empty buffer still needs to finalize pending state).

github run

@anonrig anonrig enabled auto-merge February 27, 2026 21:41
@anonrig anonrig merged commit c8cd8f1 into main Feb 27, 2026
29 checks passed
@anonrig anonrig deleted the yagiz/fix-text-decoder-empty branch February 27, 2026 22:16
@ChALkeR
Copy link
Copy Markdown

ChALkeR commented Mar 4, 2026

@anonrig I re-tested workerd, it now passes my streaming and snapshot tests!

Nice

It also passes the heavy tests
"mistakes" test was only an extract of that listed cases that tripped some engines
The actual one is pseudo-random / heavy

@ChALkeR
Copy link
Copy Markdown

ChALkeR commented Mar 4, 2026

... it somehow manages to be at least 1.5x slower than my js impl though, on workerd
but that's something that i will investigate separately

JS:

chalker@macbook-air bytes % ./node_modules/.bin/exodus-test --engine workerd:bundle tests/encoding/multi-byte.test.js
Engine: workerd:bundle, running on ./node_modules/.pnpm/workerd@1.20260301.1/node_modules/workerd/bin/workerd
Warning: workerd:bundle engine support is incomplete
# tests/encoding/multi-byte.test.js
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > gbk
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > gb18030
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > big5
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > euc-jp
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > iso-2022-jp
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > shift_jis
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > euc-kr
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > gbk
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > gb18030
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > big5
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > euc-jp
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > iso-2022-jp
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > shift_jis
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > euc-kr
workerd/server/server.c++:5928: debug: [ TEST ] main
workerd/server/server.c++:5942: debug: [ PASS ] main (9.142784s)
All 1 test suites passed
Total time: 10.027s

Native:

chalker@macbook-air bytes % ./node_modules/.bin/exodus-test --engine workerd:bundle tests/encoding/multi-byte.test.js
Engine: workerd:bundle, running on ./node_modules/.pnpm/workerd@1.20260301.1/node_modules/workerd/bin/workerd
Warning: workerd:bundle engine support is incomplete
# tests/encoding/multi-byte.test.js
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > gbk
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > gb18030
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > big5
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > euc-jp
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > iso-2022-jp
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > shift_jis
✔ PASS legacy multi-byte encodings snapshot tests > Fresh instance > euc-kr
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > gbk
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > gb18030
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > big5
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > euc-jp
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > iso-2022-jp
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > shift_jis
✔ PASS legacy multi-byte encodings snapshot tests > Reuse instance > euc-kr
workerd/server/server.c++:5928: debug: [ TEST ] main
workerd/server/server.c++:5942: debug: [ PASS ] main (14.529759s)
All 1 test suites passed
Total time: 15.559s

That includes the whole setup/snapshot logic etc, so it's actually more than 1.5x

@ChALkeR
Copy link
Copy Markdown

ChALkeR commented Mar 4, 2026

Performance is likely not that significant for legacy encodings though
What's most important is that the results are now consistent with the spec!
This is very good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

legacy multi-byte encodings in TextDecoder are wrong in streaming

3 participants