Skip to content

fix(client,internals,migrate,generator-helper): handle multibyte UTF-8 characters split across chunk boundaries in byline#28535

Merged
aqrln merged 11 commits intoprisma:mainfrom
onozaty:27695-dmmf-corruption
Dec 11, 2025
Merged

fix(client,internals,migrate,generator-helper): handle multibyte UTF-8 characters split across chunk boundaries in byline#28535
aqrln merged 11 commits intoprisma:mainfrom
onozaty:27695-dmmf-corruption

Conversation

@onozaty
Copy link
Copy Markdown
Contributor

@onozaty onozaty commented Nov 14, 2025

Summary

Fixes #27695

This PR fixes a bug where multibyte UTF-8 characters (e.g., Japanese text, emojis) were corrupted when split across stream chunk boundaries in the byline module used for parsing generator output.

Problem

The original implementation used chunk.toString(encoding) which does not handle incomplete multibyte sequences correctly. When a multibyte character is split across chunk boundaries, the incomplete bytes are replaced with the Unicode replacement character (�), causing corruption of DMMF output containing non-ASCII characters.

Solution

  • Replace chunk.toString(encoding) with Node.js StringDecoder which properly buffers incomplete multibyte sequences
  • Use decoder.write() in _transform to handle partial sequences
  • Use decoder.end() in _flush to finalize any remaining buffered bytes
  • Add node: prefix to built-in module imports for consistency

Changes

Modified packages: generator-helper, client, internals, migrate

Implementation:

  • Fixed _transform method to use StringDecoder.write()
  • Fixed _flush method to append remaining bytes from StringDecoder.end()
  • Added Comprehensive test coverage for multibyte character handling

Test Coverage

Added 5 tests per package (20 total) covering:

  • ✅ Single-byte ASCII characters
  • ✅ Multibyte characters (Japanese, emoji) in complete chunks
  • ✅ Multibyte characters split across chunk boundaries
  • ✅ Truncated incomplete UTF-8 sequences at stream end

All tests pass in affected packages (generator-helper, client, internals, migrate).

Copilot AI review requested due to automatic review settings November 14, 2025 23:17
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Nov 14, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a critical bug where multibyte UTF-8 characters (Japanese text, emojis) were corrupted when split across stream chunk boundaries in the byline module. The fix replaces unsafe chunk.toString(encoding) calls with Node.js's StringDecoder, which properly buffers incomplete multibyte sequences.

Key changes:

  • Replaced Buffer.toString() with StringDecoder for proper multibyte handling
  • Added _flush method logic to finalize any buffered bytes at stream end
  • Added comprehensive test coverage for multibyte character edge cases
  • Updated to use node: prefix for built-in module imports

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/migrate/src/utils/byline.ts Implemented StringDecoder for UTF-8 multibyte character handling
packages/migrate/src/tests/byline.test.ts Added test coverage for single-byte, multibyte, and split boundary scenarios
packages/internals/src/utils/byline.ts Implemented StringDecoder for UTF-8 multibyte character handling
packages/internals/src/tests/byline.test.ts Added test coverage for single-byte, multibyte, and split boundary scenarios
packages/generator-helper/src/byline.ts Implemented StringDecoder for UTF-8 multibyte character handling
packages/generator-helper/src/tests/byline.test.ts Added test coverage for single-byte, multibyte, and split boundary scenarios (using vitest)
packages/client/src/byline.ts Implemented StringDecoder for UTF-8 multibyte character handling
packages/client/src/tests/byline.test.ts Added test coverage for single-byte, multibyte, and split boundary scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onozaty added a commit to onozaty/prisma that referenced this pull request Nov 15, 2025
…es in byline StringDecoder

- Add _decoderEncoding to track current decoder encoding
- Recreate StringDecoder when encoding changes between chunks
- Add test coverage for encoding change scenarios (ascii to utf8)
- Addresses Copilot review feedback on PR prisma#28535
@onozaty onozaty requested a review from Copilot November 15, 2025 04:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onozaty added a commit to onozaty/prisma that referenced this pull request Nov 15, 2025
… in byline tests

- Replace it() with test() for consistency across all test files
- Addresses Copilot review feedback on PR prisma#28535
@aqrln
Copy link
Copy Markdown
Member

aqrln commented Nov 17, 2025

Thanks so much for this! Could we also deduplicate this and put it in a single place while we're here? Even better, any chance we could use the built-in node:readline module instead?

@onozaty
Copy link
Copy Markdown
Contributor Author

onozaty commented Nov 17, 2025

Thank you for the feedback about deduplication!

I've replaced the custom byline implementation with Node.js's built-in node:readline module across all packages. This eliminates the duplicate code entirely.

Changes Made

  • Replaced byline with readline.createInterface() in:
    • @prisma/generator-helper (GeneratorProcess, generatorHandler)
    • @prisma/client (BinaryEngine)
    • @prisma/migrate (SchemaEngineCLI)
    • @prisma/internals (removed export)
  • Removed 4 duplicate byline.ts implementations and their tests
  • Removed byline.ts from eslint ignore patterns

Commits:

  • d37f117 - refactor(client,internals,migrate,generator-helper): replace byline with node:readline
  • 84fca2c - chore: remove byline.ts from eslint ignore patterns
  • b5c98f1 - test(generator-helper): add readline multibyte UTF-8 handling tests

Multibyte UTF-8 Character Handling

The original issue this PR addresses (multibyte UTF-8 character corruption) is resolved by using node:readline, which correctly handles multibyte characters split across chunk boundaries.

While it's difficult to write integration tests that reliably reproduce the original corruption issue (we cannot control exactly where child process buffers split), I've added unit tests in packages/generator-helper/src/__tests__/readline-utf8-split.test.ts that verify readline.createInterface() with crlfDelay: Infinity correctly handles multibyte UTF-8 characters split across chunks.

@aqrln aqrln added this to the 7.0.0 milestone Nov 18, 2025
Add tests to verify LineStream behavior with multibyte UTF-8 characters.
Tests confirm that single-byte and multibyte characters work correctly
within single chunks, and expose a bug where multibyte characters split
across chunk boundaries become corrupted.

Test structure:
- Single-byte characters: both patterns pass
- Multibyte characters: single chunk passes, split boundary fails
Use StringDecoder to properly handle incomplete multibyte UTF-8 sequences
that are split across chunk boundaries in streaming data. Without this,
characters like emoji and CJK text would be corrupted when split between
chunks.

Changes:
- Add StringDecoder import and instance to LineStream
- Replace Buffer.toString() with decoder.write() for proper streaming
- Finalize decoder in _flush() to handle remaining buffered bytes
- Assume single encoding per stream (encoding doesn't change mid-stream)

Fixes the test case where Japanese character "あ" split across chunks
would become replacement characters (���) instead of the correct character.
…lush logic

- Add test for truncated multibyte sequences at stream end (tests _flush method)
- Simplify _flush logic by removing unreachable else branch
- Add clarifying comment about split() behavior guarantees
…t across chunk boundaries in byline

- Add StringDecoder to properly handle incomplete multibyte sequences
- Use node: prefix for built-in module imports
- Fix _transform to use decoder.write() instead of toString()
- Fix _flush to append remaining buffered bytes from decoder.end()
- Add comprehensive tests for multibyte character handling
…es in byline StringDecoder

- Add _decoderEncoding to track current decoder encoding
- Recreate StringDecoder when encoding changes between chunks
- Add test coverage for encoding change scenarios (ascii to utf8)
- Addresses Copilot review feedback on PR prisma#28535
… in byline tests

- Replace it() with test() for consistency across all test files
- Addresses Copilot review feedback on PR prisma#28535
…ith node:readline

Replace custom byline implementation with Node.js built-in readline module.

- Remove 4 duplicate byline.ts implementations and tests
- Update BinaryEngine, SchemaEngineCLI, GeneratorProcess to use readline
- Remove byline exports from internals and migrate packages
@onozaty onozaty force-pushed the 27695-dmmf-corruption branch from b5c98f1 to 629b70e Compare November 18, 2025 13:09
- Remove PR-specific context from test comments
- Add reference to original issue prisma#27695 for regression test clarity
@jkomyno jkomyno removed this from the 7.0.0 milestone Nov 19, 2025
@onozaty onozaty requested a review from aqrln November 25, 2025 12:44
Copy link
Copy Markdown
Member

@aqrln aqrln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice, thank you for working on this and sorry for the delay!

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 11, 2025
@aqrln aqrln added this to the 7.2.0 milestone Dec 11, 2025
@aqrln aqrln merged commit 6b8702c into prisma:main Dec 11, 2025
180 checks passed
jkomyno pushed a commit that referenced this pull request Dec 16, 2025
…8 characters split across chunk boundaries in byline (#28535)

## Summary

Fixes #27695 

This PR fixes a bug where multibyte UTF-8 characters (e.g., Japanese
text, emojis) were corrupted when split across stream chunk boundaries
in the `byline` module used for parsing generator output.

## Problem

The original implementation used `chunk.toString(encoding)` which does
not handle incomplete multibyte sequences correctly. When a multibyte
character is split across chunk boundaries, the incomplete bytes are
replaced with the Unicode replacement character (�), causing corruption
of DMMF output containing non-ASCII characters.

## Solution

- Replace `chunk.toString(encoding)` with Node.js `StringDecoder` which
properly buffers incomplete multibyte sequences
- Use `decoder.write()` in `_transform` to handle partial sequences
- Use `decoder.end()` in `_flush` to finalize any remaining buffered
bytes
- Add `node:` prefix to built-in module imports for consistency

## Changes

**Modified packages**: `generator-helper`, `client`, `internals`,
`migrate`

**Implementation:**
- **Fixed** `_transform` method to use `StringDecoder.write()`
- **Fixed** `_flush` method to append remaining bytes from
`StringDecoder.end()`
- **Added** Comprehensive test coverage for multibyte character handling

## Test Coverage

Added 5 tests per package (20 total) covering:
- ✅ Single-byte ASCII characters
- ✅ Multibyte characters (Japanese, emoji) in complete chunks
- ✅ Multibyte characters split across chunk boundaries
- ✅ Truncated incomplete UTF-8 sequences at stream end

All tests pass in affected packages (`generator-helper`, `client`,
`internals`, `migrate`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-6.x-candidate lgtm This PR has been approved by a maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DMMF Generation Causes UTF-8 Character Corruption for Japanese Characters

5 participants