[8.0.0] Reland "Fix most Unicode encoding bugs" by Wyverald · Pull Request #24350 · bazelbuild/bazel

Wyverald · 2024-11-15T20:40:38Z

Roll forward of a58fe3f: Fix most Unicode encoding bugs.

NEW: Relative to the original CL, the only changes are in latin1_jni_path.cc and NativePosixFiles.java. In latin1_jni_path.cc, the former implementation of GetStringLatin1Chars with a fallback for strings with a UTF-16 coder is restored, with a BugReport sent whenever any such string is detected. In NativePosixFiles.java, a private method is added to be called from JNI to send the BugReport.

*** Original change description ***

Automated rollback of commit a58fe3f.

*** Reason for rollback ***

Causing crashes for internal software that uses Bazel's VFS stuff.

*** Original change description ***

Fix most Unicode encoding bugs

Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes Strings as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the String contents, depending on the OS and availability of a Latin-1 locale.

PiperOrigin-RevId: 696524066
Change-Id: Ifdddacc08c1a81ad719b1aeac2a93882cbafbcd2

NEW: Relative to the original CL, the only changes are in latin1_jni_path.cc and NativePosixFiles.java. In latin1_jni_path.cc, the former implementation of GetStringLatin1Chars with a fallback for strings with a UTF-16 coder is restored, with a BugReport sent whenever any such string is detected. In NativePosixFiles.java, a private method is added to be called from JNI to send the BugReport. *** Original change description *** Automated rollback of commit a58fe3f. *** Reason for rollback *** Causing crashes for internal software that uses Bazel's VFS stuff. *** Original change description *** Fix most Unicode encoding bugs Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes Strings as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the String contents, depending on the OS and availability of a Latin-1 locale. *** PiperOrigin-RevId: 696524066 Change-Id: Ifdddacc08c1a81ad719b1aeac2a93882cbafbcd2

Prior to Bazel 8, manifest files containing non-ASCII characters were written with UTF-16LE encoding instead of UTF-8 on Windows: - bazelbuild/bazel#24231 - bazelbuild/bazel#24350 - bazelbuild/bazel#24403 This led to disable failing tests in CI: `//tests/zip:unicode_test`: ``` File "pkg\private\manifest.py", line 59, in read_entries_from raw_entries = json.loads(fh.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte ``` `//tests/mappings:utf8_manifest_test`: ``` File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch got = json.loads(g_fp.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte ``` Since the manifest is plain JSON, the fix simply consists in detecting whether the second byte is `0`, where the default UTF-8 decoding would fail, in which case we assume the file is UTF-16LE-encoded. The code is slightly reorganized to factor out the encoding selection. This allows to enable `//tests/mappings:utf8_manifest_test` and `//tests/zip:unicode_test` tests in Windows CI.

Wyverald requested a review from a team as a code owner November 15, 2024 20:40

iancha1992 approved these changes Nov 15, 2024

View reviewed changes

iancha1992 added this pull request to the merge queue Nov 15, 2024

iancha1992 added this to the 8.0.0 release blockers milestone Nov 15, 2024

Merged via the queue into release-8.0.0 with commit 2733cff Nov 15, 2024

github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Nov 15, 2024

Wyverald deleted the wyv-800-encoding branch November 16, 2024 00:02

Wyverald mentioned this pull request Nov 16, 2024

[8.0.0] Reland Unicode bug fixes #24277

Closed

rdesgroppes mentioned this pull request Feb 8, 2026

fix: handle UTF-16LE-encoded manifests on Windows bazelbuild/rules_pkg#1021

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8.0.0] Reland "Fix most Unicode encoding bugs"#24350

[8.0.0] Reland "Fix most Unicode encoding bugs"#24350
iancha1992 merged 1 commit intorelease-8.0.0from
wyv-800-encoding

Wyverald commented Nov 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Wyverald commented Nov 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants