fix(client,internals,migrate,generator-helper): handle multibyte UTF-8 characters split across chunk boundaries in byline#28535
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a critical bug where multibyte UTF-8 characters (Japanese text, emojis) were corrupted when split across stream chunk boundaries in the byline module. The fix replaces unsafe chunk.toString(encoding) calls with Node.js's StringDecoder, which properly buffers incomplete multibyte sequences.
Key changes:
- Replaced
Buffer.toString()withStringDecoderfor proper multibyte handling - Added
_flushmethod logic to finalize any buffered bytes at stream end - Added comprehensive test coverage for multibyte character edge cases
- Updated to use
node:prefix for built-in module imports
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/migrate/src/utils/byline.ts | Implemented StringDecoder for UTF-8 multibyte character handling |
| packages/migrate/src/tests/byline.test.ts | Added test coverage for single-byte, multibyte, and split boundary scenarios |
| packages/internals/src/utils/byline.ts | Implemented StringDecoder for UTF-8 multibyte character handling |
| packages/internals/src/tests/byline.test.ts | Added test coverage for single-byte, multibyte, and split boundary scenarios |
| packages/generator-helper/src/byline.ts | Implemented StringDecoder for UTF-8 multibyte character handling |
| packages/generator-helper/src/tests/byline.test.ts | Added test coverage for single-byte, multibyte, and split boundary scenarios (using vitest) |
| packages/client/src/byline.ts | Implemented StringDecoder for UTF-8 multibyte character handling |
| packages/client/src/tests/byline.test.ts | Added test coverage for single-byte, multibyte, and split boundary scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…es in byline StringDecoder - Add _decoderEncoding to track current decoder encoding - Recreate StringDecoder when encoding changes between chunks - Add test coverage for encoding change scenarios (ascii to utf8) - Addresses Copilot review feedback on PR prisma#28535
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… in byline tests - Replace it() with test() for consistency across all test files - Addresses Copilot review feedback on PR prisma#28535
|
Thanks so much for this! Could we also deduplicate this and put it in a single place while we're here? Even better, any chance we could use the built-in |
|
Thank you for the feedback about deduplication! I've replaced the custom byline implementation with Node.js's built-in Changes Made
Commits:
Multibyte UTF-8 Character HandlingThe original issue this PR addresses (multibyte UTF-8 character corruption) is resolved by using While it's difficult to write integration tests that reliably reproduce the original corruption issue (we cannot control exactly where child process buffers split), I've added unit tests in |
packages/client/src/runtime/core/engines/binary/BinaryEngine.ts
Outdated
Show resolved
Hide resolved
packages/generator-helper/src/__tests__/readline-utf8-split.test.ts
Outdated
Show resolved
Hide resolved
Add tests to verify LineStream behavior with multibyte UTF-8 characters. Tests confirm that single-byte and multibyte characters work correctly within single chunks, and expose a bug where multibyte characters split across chunk boundaries become corrupted. Test structure: - Single-byte characters: both patterns pass - Multibyte characters: single chunk passes, split boundary fails
Use StringDecoder to properly handle incomplete multibyte UTF-8 sequences that are split across chunk boundaries in streaming data. Without this, characters like emoji and CJK text would be corrupted when split between chunks. Changes: - Add StringDecoder import and instance to LineStream - Replace Buffer.toString() with decoder.write() for proper streaming - Finalize decoder in _flush() to handle remaining buffered bytes - Assume single encoding per stream (encoding doesn't change mid-stream) Fixes the test case where Japanese character "あ" split across chunks would become replacement characters (���) instead of the correct character.
…lush logic - Add test for truncated multibyte sequences at stream end (tests _flush method) - Simplify _flush logic by removing unreachable else branch - Add clarifying comment about split() behavior guarantees
…t across chunk boundaries in byline - Add StringDecoder to properly handle incomplete multibyte sequences - Use node: prefix for built-in module imports - Fix _transform to use decoder.write() instead of toString() - Fix _flush to append remaining buffered bytes from decoder.end() - Add comprehensive tests for multibyte character handling
…es in byline StringDecoder - Add _decoderEncoding to track current decoder encoding - Recreate StringDecoder when encoding changes between chunks - Add test coverage for encoding change scenarios (ascii to utf8) - Addresses Copilot review feedback on PR prisma#28535
… in byline tests - Replace it() with test() for consistency across all test files - Addresses Copilot review feedback on PR prisma#28535
…ith node:readline Replace custom byline implementation with Node.js built-in readline module. - Remove 4 duplicate byline.ts implementations and tests - Update BinaryEngine, SchemaEngineCLI, GeneratorProcess to use readline - Remove byline exports from internals and migrate packages
b5c98f1 to
629b70e
Compare
- Remove PR-specific context from test comments - Add reference to original issue prisma#27695 for regression test clarity
aqrln
left a comment
There was a problem hiding this comment.
This is very nice, thank you for working on this and sorry for the delay!
…8 characters split across chunk boundaries in byline (#28535) ## Summary Fixes #27695 This PR fixes a bug where multibyte UTF-8 characters (e.g., Japanese text, emojis) were corrupted when split across stream chunk boundaries in the `byline` module used for parsing generator output. ## Problem The original implementation used `chunk.toString(encoding)` which does not handle incomplete multibyte sequences correctly. When a multibyte character is split across chunk boundaries, the incomplete bytes are replaced with the Unicode replacement character (�), causing corruption of DMMF output containing non-ASCII characters. ## Solution - Replace `chunk.toString(encoding)` with Node.js `StringDecoder` which properly buffers incomplete multibyte sequences - Use `decoder.write()` in `_transform` to handle partial sequences - Use `decoder.end()` in `_flush` to finalize any remaining buffered bytes - Add `node:` prefix to built-in module imports for consistency ## Changes **Modified packages**: `generator-helper`, `client`, `internals`, `migrate` **Implementation:** - **Fixed** `_transform` method to use `StringDecoder.write()` - **Fixed** `_flush` method to append remaining bytes from `StringDecoder.end()` - **Added** Comprehensive test coverage for multibyte character handling ## Test Coverage Added 5 tests per package (20 total) covering: - ✅ Single-byte ASCII characters - ✅ Multibyte characters (Japanese, emoji) in complete chunks - ✅ Multibyte characters split across chunk boundaries - ✅ Truncated incomplete UTF-8 sequences at stream end All tests pass in affected packages (`generator-helper`, `client`, `internals`, `migrate`).
Summary
Fixes #27695
This PR fixes a bug where multibyte UTF-8 characters (e.g., Japanese text, emojis) were corrupted when split across stream chunk boundaries in the
bylinemodule used for parsing generator output.Problem
The original implementation used
chunk.toString(encoding)which does not handle incomplete multibyte sequences correctly. When a multibyte character is split across chunk boundaries, the incomplete bytes are replaced with the Unicode replacement character (�), causing corruption of DMMF output containing non-ASCII characters.Solution
chunk.toString(encoding)with Node.jsStringDecoderwhich properly buffers incomplete multibyte sequencesdecoder.write()in_transformto handle partial sequencesdecoder.end()in_flushto finalize any remaining buffered bytesnode:prefix to built-in module imports for consistencyChanges
Modified packages:
generator-helper,client,internals,migrateImplementation:
_transformmethod to useStringDecoder.write()_flushmethod to append remaining bytes fromStringDecoder.end()Test Coverage
Added 5 tests per package (20 total) covering:
All tests pass in affected packages (
generator-helper,client,internals,migrate).