fix(client,internals,migrate,generator-helper): handle multibyte UTF-8 characters split across chunk boundaries in byline by onozaty · Pull Request #28535 · prisma/prisma

onozaty · 2025-11-14T23:17:50Z

Summary

This PR fixes a bug where multibyte UTF-8 characters (e.g., Japanese text, emojis) were corrupted when split across stream chunk boundaries in the byline module used for parsing generator output.

Problem

The original implementation used chunk.toString(encoding) which does not handle incomplete multibyte sequences correctly. When a multibyte character is split across chunk boundaries, the incomplete bytes are replaced with the Unicode replacement character (�), causing corruption of DMMF output containing non-ASCII characters.

Solution

Replace chunk.toString(encoding) with Node.js StringDecoder which properly buffers incomplete multibyte sequences
Use decoder.write() in _transform to handle partial sequences
Use decoder.end() in _flush to finalize any remaining buffered bytes
Add node: prefix to built-in module imports for consistency

Changes

Modified packages: generator-helper, client, internals, migrate

Implementation:

Fixed _transform method to use StringDecoder.write()
Fixed _flush method to append remaining bytes from StringDecoder.end()
Added Comprehensive test coverage for multibyte character handling

Test Coverage

Added 5 tests per package (20 total) covering:

✅ Single-byte ASCII characters
✅ Multibyte characters (Japanese, emoji) in complete chunks
✅ Multibyte characters split across chunk boundaries
✅ Truncated incomplete UTF-8 sequences at stream end

All tests pass in affected packages (generator-helper, client, internals, migrate).

CLAassistant · 2025-11-14T23:17:57Z

All committers have signed the CLA.

Copilot

Pull Request Overview

This PR fixes a critical bug where multibyte UTF-8 characters (Japanese text, emojis) were corrupted when split across stream chunk boundaries in the byline module. The fix replaces unsafe chunk.toString(encoding) calls with Node.js's StringDecoder, which properly buffers incomplete multibyte sequences.

Key changes:

Replaced Buffer.toString() with StringDecoder for proper multibyte handling
Added _flush method logic to finalize any buffered bytes at stream end
Added comprehensive test coverage for multibyte character edge cases
Updated to use node: prefix for built-in module imports

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
packages/migrate/src/utils/byline.ts	Implemented StringDecoder for UTF-8 multibyte character handling
packages/migrate/src/tests/byline.test.ts	Added test coverage for single-byte, multibyte, and split boundary scenarios
packages/internals/src/utils/byline.ts	Implemented StringDecoder for UTF-8 multibyte character handling
packages/internals/src/tests/byline.test.ts	Added test coverage for single-byte, multibyte, and split boundary scenarios
packages/generator-helper/src/byline.ts	Implemented StringDecoder for UTF-8 multibyte character handling
packages/generator-helper/src/tests/byline.test.ts	Added test coverage for single-byte, multibyte, and split boundary scenarios (using vitest)
packages/client/src/byline.ts	Implemented StringDecoder for UTF-8 multibyte character handling
packages/client/src/tests/byline.test.ts	Added test coverage for single-byte, multibyte, and split boundary scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

packages/migrate/src/utils/byline.ts

packages/internals/src/utils/byline.ts

packages/generator-helper/src/byline.ts

packages/client/src/byline.ts

…es in byline StringDecoder - Add _decoderEncoding to track current decoder encoding - Recreate StringDecoder when encoding changes between chunks - Add test coverage for encoding change scenarios (ascii to utf8) - Addresses Copilot review feedback on PR prisma#28535

Copilot

Pull Request Overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

packages/migrate/src/__tests__/byline.test.ts

packages/generator-helper/src/__tests__/byline.test.ts

packages/migrate/src/utils/byline.ts

packages/internals/src/utils/byline.ts

packages/generator-helper/src/byline.ts

packages/client/src/byline.ts

… in byline tests - Replace it() with test() for consistency across all test files - Addresses Copilot review feedback on PR prisma#28535

aqrln · 2025-11-17T11:35:17Z

Thanks so much for this! Could we also deduplicate this and put it in a single place while we're here? Even better, any chance we could use the built-in node:readline module instead?

onozaty · 2025-11-17T15:07:03Z

Thank you for the feedback about deduplication!

I've replaced the custom byline implementation with Node.js's built-in node:readline module across all packages. This eliminates the duplicate code entirely.

Changes Made

Replaced byline with readline.createInterface() in:
- @prisma/generator-helper (GeneratorProcess, generatorHandler)
- @prisma/client (BinaryEngine)
- @prisma/migrate (SchemaEngineCLI)
- @prisma/internals (removed export)
Removed 4 duplicate byline.ts implementations and their tests
Removed byline.ts from eslint ignore patterns

Commits:

d37f117 - refactor(client,internals,migrate,generator-helper): replace byline with node:readline
84fca2c - chore: remove byline.ts from eslint ignore patterns
b5c98f1 - test(generator-helper): add readline multibyte UTF-8 handling tests

Multibyte UTF-8 Character Handling

The original issue this PR addresses (multibyte UTF-8 character corruption) is resolved by using node:readline, which correctly handles multibyte characters split across chunk boundaries.

While it's difficult to write integration tests that reliably reproduce the original corruption issue (we cannot control exactly where child process buffers split), I've added unit tests in packages/generator-helper/src/__tests__/readline-utf8-split.test.ts that verify readline.createInterface() with crlfDelay: Infinity correctly handles multibyte UTF-8 characters split across chunks.

packages/client/src/runtime/core/engines/binary/BinaryEngine.ts

packages/generator-helper/src/__tests__/readline-utf8-split.test.ts

Add tests to verify LineStream behavior with multibyte UTF-8 characters. Tests confirm that single-byte and multibyte characters work correctly within single chunks, and expose a bug where multibyte characters split across chunk boundaries become corrupted. Test structure: - Single-byte characters: both patterns pass - Multibyte characters: single chunk passes, split boundary fails

Use StringDecoder to properly handle incomplete multibyte UTF-8 sequences that are split across chunk boundaries in streaming data. Without this, characters like emoji and CJK text would be corrupted when split between chunks. Changes: - Add StringDecoder import and instance to LineStream - Replace Buffer.toString() with decoder.write() for proper streaming - Finalize decoder in _flush() to handle remaining buffered bytes - Assume single encoding per stream (encoding doesn't change mid-stream) Fixes the test case where Japanese character "あ" split across chunks would become replacement characters (��) instead of the correct character.

…lush logic - Add test for truncated multibyte sequences at stream end (tests _flush method) - Simplify _flush logic by removing unreachable else branch - Add clarifying comment about split() behavior guarantees

…rts in byline

…t across chunk boundaries in byline - Add StringDecoder to properly handle incomplete multibyte sequences - Use node: prefix for built-in module imports - Fix _transform to use decoder.write() instead of toString() - Fix _flush to append remaining buffered bytes from decoder.end() - Add comprehensive tests for multibyte character handling

…es in byline StringDecoder - Add _decoderEncoding to track current decoder encoding - Recreate StringDecoder when encoding changes between chunks - Add test coverage for encoding change scenarios (ascii to utf8) - Addresses Copilot review feedback on PR prisma#28535

… in byline tests - Replace it() with test() for consistency across all test files - Addresses Copilot review feedback on PR prisma#28535

…ith node:readline Replace custom byline implementation with Node.js built-in readline module. - Remove 4 duplicate byline.ts implementations and tests - Update BinaryEngine, SchemaEngineCLI, GeneratorProcess to use readline - Remove byline exports from internals and migrate packages

- Remove PR-specific context from test comments - Add reference to original issue prisma#27695 for regression test clarity

aqrln

This is very nice, thank you for working on this and sorry for the delay!

…8 characters split across chunk boundaries in byline (#28535) ## Summary Fixes #27695 This PR fixes a bug where multibyte UTF-8 characters (e.g., Japanese text, emojis) were corrupted when split across stream chunk boundaries in the `byline` module used for parsing generator output. ## Problem The original implementation used `chunk.toString(encoding)` which does not handle incomplete multibyte sequences correctly. When a multibyte character is split across chunk boundaries, the incomplete bytes are replaced with the Unicode replacement character (�), causing corruption of DMMF output containing non-ASCII characters. ## Solution - Replace `chunk.toString(encoding)` with Node.js `StringDecoder` which properly buffers incomplete multibyte sequences - Use `decoder.write()` in `_transform` to handle partial sequences - Use `decoder.end()` in `_flush` to finalize any remaining buffered bytes - Add `node:` prefix to built-in module imports for consistency ## Changes **Modified packages**: `generator-helper`, `client`, `internals`, `migrate` **Implementation:** - **Fixed** `_transform` method to use `StringDecoder.write()` - **Fixed** `_flush` method to append remaining bytes from `StringDecoder.end()` - **Added** Comprehensive test coverage for multibyte character handling ## Test Coverage Added 5 tests per package (20 total) covering: - ✅ Single-byte ASCII characters - ✅ Multibyte characters (Japanese, emoji) in complete chunks - ✅ Multibyte characters split across chunk boundaries - ✅ Truncated incomplete UTF-8 sequences at stream end All tests pass in affected packages (`generator-helper`, `client`, `internals`, `migrate`).

Copilot AI review requested due to automatic review settings November 14, 2025 23:17

Copilot started reviewing on behalf of onozaty November 14, 2025 23:18 View session

Copilot finished reviewing on behalf of onozaty November 14, 2025 23:20

Copilot AI reviewed Nov 14, 2025

View reviewed changes

packages/migrate/src/utils/byline.ts Outdated Show resolved Hide resolved

packages/internals/src/utils/byline.ts Outdated Show resolved Hide resolved

packages/generator-helper/src/byline.ts Outdated Show resolved Hide resolved

packages/client/src/byline.ts Outdated Show resolved Hide resolved

onozaty requested a review from Copilot November 15, 2025 04:56

Copilot started reviewing on behalf of onozaty November 15, 2025 04:56 View session

Copilot finished reviewing on behalf of onozaty November 15, 2025 04:58

Copilot AI reviewed Nov 15, 2025

View reviewed changes

aqrln reviewed Nov 18, 2025

View reviewed changes

packages/client/src/runtime/core/engines/binary/BinaryEngine.ts Outdated Show resolved Hide resolved

aqrln reviewed Nov 18, 2025

View reviewed changes

packages/generator-helper/src/__tests__/readline-utf8-split.test.ts Outdated Show resolved Hide resolved

aqrln reviewed Nov 18, 2025

View reviewed changes

packages/generator-helper/src/__tests__/readline-utf8-split.test.ts Show resolved Hide resolved

aqrln added this to the 7.0.0 milestone Nov 18, 2025

onozaty added 10 commits November 18, 2025 21:53

test(generator-helper): improve byline multibyte tests and simplify f…

7d12225

…lush logic - Add test for truncated multibyte sequences at stream end (tests _flush method) - Simplify _flush logic by removing unreachable else branch - Add clarifying comment about split() behavior guarantees

refactor(generator-helper): use node: prefix for built-in module impo…

0cf312d

…rts in byline

test(client,internals,migrate): use consistent test() function naming…

b3a3715

… in byline tests - Replace it() with test() for consistency across all test files - Addresses Copilot review feedback on PR prisma#28535

chore: remove byline.ts from eslint ignore patterns

aeb2f8d

test(generator-helper): add readline multibyte UTF-8 handling tests

629b70e

onozaty force-pushed the 27695-dmmf-corruption branch from b5c98f1 to 629b70e Compare November 18, 2025 13:09

docs(generator-helper): improve test documentation for UTF-8 handling

32f73f6

- Remove PR-specific context from test comments - Add reference to original issue prisma#27695 for regression test clarity

jkomyno removed this from the 7.0.0 milestone Nov 19, 2025

onozaty requested a review from aqrln November 25, 2025 12:44

aqrln approved these changes Dec 11, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 11, 2025

aqrln added this to the 7.2.0 milestone Dec 11, 2025

aqrln merged commit 6b8702c into prisma:main Dec 11, 2025
180 checks passed

aqrln added the backport-6.x-candidate label Dec 11, 2025

Conversation

onozaty commented Nov 14, 2025

Summary

Problem

Solution

Changes

Test Coverage

Uh oh!

CLAassistant commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aqrln commented Nov 17, 2025

Uh oh!

onozaty commented Nov 17, 2025

Changes Made

Multibyte UTF-8 Character Handling

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aqrln left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Nov 14, 2025 •

edited

Loading