[8.0.0] Reland "Fix most Unicode encoding bugs"#24350
Merged
iancha1992 merged 1 commit intorelease-8.0.0from Nov 15, 2024
Merged
[8.0.0] Reland "Fix most Unicode encoding bugs"#24350iancha1992 merged 1 commit intorelease-8.0.0from
iancha1992 merged 1 commit intorelease-8.0.0from
Conversation
NEW: Relative to the original CL, the only changes are in latin1_jni_path.cc and NativePosixFiles.java. In latin1_jni_path.cc, the former implementation of GetStringLatin1Chars with a fallback for strings with a UTF-16 coder is restored, with a BugReport sent whenever any such string is detected. In NativePosixFiles.java, a private method is added to be called from JNI to send the BugReport. *** Original change description *** Automated rollback of commit a58fe3f. *** Reason for rollback *** Causing crashes for internal software that uses Bazel's VFS stuff. *** Original change description *** Fix most Unicode encoding bugs Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes Strings as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the String contents, depending on the OS and availability of a Latin-1 locale. *** PiperOrigin-RevId: 696524066 Change-Id: Ifdddacc08c1a81ad719b1aeac2a93882cbafbcd2
iancha1992
approved these changes
Nov 15, 2024
rdesgroppes
added a commit
to rdesgroppes/rules_pkg
that referenced
this pull request
Feb 8, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were written with UTF-16LE encoding instead of UTF-8 on Windows: - bazelbuild/bazel#24231 - bazelbuild/bazel#24350 - bazelbuild/bazel#24403 This led to disable failing tests in CI: `//tests/zip:unicode_test`: ``` File "pkg\private\manifest.py", line 59, in read_entries_from raw_entries = json.loads(fh.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte ``` `//tests/mappings:utf8_manifest_test`: ``` File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch got = json.loads(g_fp.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte ``` Since the manifest is plain JSON, the fix simply consists in detecting whether the second byte is `0`, where the default UTF-8 decoding would fail, in which case we assume the file is UTF-16LE-encoded. The code is slightly reorganized to factor out the encoding selection. This allows to enable `//tests/mappings:utf8_manifest_test` and `//tests/zip:unicode_test` tests in Windows CI.
rdesgroppes
added a commit
to rdesgroppes/rules_pkg
that referenced
this pull request
Feb 10, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were written with UTF-16LE encoding instead of UTF-8 on Windows: - bazelbuild/bazel#24231 - bazelbuild/bazel#24350 - bazelbuild/bazel#24403 This led to disable failing tests in CI: `//tests/zip:unicode_test`: ``` File "pkg\private\manifest.py", line 59, in read_entries_from raw_entries = json.loads(fh.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte ``` `//tests/mappings:utf8_manifest_test`: ``` File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch got = json.loads(g_fp.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte ``` Since the manifest is plain JSON, the fix simply consists in detecting whether the second byte is `0`, where the default UTF-8 decoding would fail, in which case we assume the file is UTF-16LE-encoded. The code is slightly reorganized to factor out the encoding selection. This allows to enable `//tests/mappings:utf8_manifest_test` and `//tests/zip:unicode_test` tests in Windows CI.
rdesgroppes
added a commit
to rdesgroppes/rules_pkg
that referenced
this pull request
Feb 10, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were written with UTF-16LE encoding instead of UTF-8 on Windows: - bazelbuild/bazel#24231 - bazelbuild/bazel#24350 - bazelbuild/bazel#24403 This led to disable failing tests in CI: `//tests/zip:unicode_test`: ``` File "pkg\private\manifest.py", line 59, in read_entries_from raw_entries = json.loads(fh.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte ``` `//tests/mappings:utf8_manifest_test`: ``` File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch got = json.loads(g_fp.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte ``` Since the manifest is plain JSON, the fix simply consists in detecting whether the second byte is `0`, where the default UTF-8 decoding would fail, in which case we assume the file is UTF-16LE-encoded. The code is slightly reorganized to factor out the encoding selection. This allows to enable `//tests/mappings:utf8_manifest_test` and `//tests/zip:unicode_test` tests in Windows CI.
tonyaiuto
added a commit
to bazelbuild/rules_pkg
that referenced
this pull request
Feb 19, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were written with UTF-16LE encoding instead of UTF-8 on Windows: - bazelbuild/bazel#24231 - bazelbuild/bazel#24350 - bazelbuild/bazel#24403 This led to disable failing tests in CI: `//tests/zip:unicode_test`: ``` File "pkg\private\manifest.py", line 59, in read_entries_from raw_entries = json.loads(fh.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte ``` `//tests/mappings:utf8_manifest_test`: ``` File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch got = json.loads(g_fp.read()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte ``` Since the manifest is plain JSON, the fix simply consists in detecting whether the second byte is `0`, where the default UTF-8 decoding would fail, in which case we assume the file is UTF-16LE-encoded. The code is slightly reorganized to factor out the encoding selection. This allows to enable `//tests/mappings:utf8_manifest_test` and `//tests/zip:unicode_test` tests in Windows CI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Roll forward of a58fe3f: Fix most Unicode encoding bugs.
NEW: Relative to the original CL, the only changes are in latin1_jni_path.cc and NativePosixFiles.java. In latin1_jni_path.cc, the former implementation of GetStringLatin1Chars with a fallback for strings with a UTF-16 coder is restored, with a BugReport sent whenever any such string is detected. In NativePosixFiles.java, a private method is added to be called from JNI to send the BugReport.
*** Original change description ***
Automated rollback of commit a58fe3f.
*** Reason for rollback ***
Causing crashes for internal software that uses Bazel's VFS stuff.
*** Original change description ***
Fix most Unicode encoding bugs
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes Strings as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the String contents, depending on the OS and availability of a Latin-1 locale.
PiperOrigin-RevId: 696524066
Change-Id: Ifdddacc08c1a81ad719b1aeac2a93882cbafbcd2