Skip to content

[8.0.0] Reland "Fix most Unicode encoding bugs"#24350

Merged
iancha1992 merged 1 commit intorelease-8.0.0from
wyv-800-encoding
Nov 15, 2024
Merged

[8.0.0] Reland "Fix most Unicode encoding bugs"#24350
iancha1992 merged 1 commit intorelease-8.0.0from
wyv-800-encoding

Conversation

@Wyverald
Copy link
Copy Markdown
Member

Roll forward of a58fe3f: Fix most Unicode encoding bugs.

NEW: Relative to the original CL, the only changes are in latin1_jni_path.cc and NativePosixFiles.java. In latin1_jni_path.cc, the former implementation of GetStringLatin1Chars with a fallback for strings with a UTF-16 coder is restored, with a BugReport sent whenever any such string is detected. In NativePosixFiles.java, a private method is added to be called from JNI to send the BugReport.

*** Original change description ***

Automated rollback of commit a58fe3f.

*** Reason for rollback ***

Causing crashes for internal software that uses Bazel's VFS stuff.

*** Original change description ***

Fix most Unicode encoding bugs

Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes Strings as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the String contents, depending on the OS and availability of a Latin-1 locale.


PiperOrigin-RevId: 696524066
Change-Id: Ifdddacc08c1a81ad719b1aeac2a93882cbafbcd2

NEW: Relative to the original CL, the only changes are in latin1_jni_path.cc and NativePosixFiles.java. In latin1_jni_path.cc, the former implementation of GetStringLatin1Chars with a fallback for strings with a UTF-16 coder is restored, with a BugReport sent whenever any such string is detected. In NativePosixFiles.java, a private method is added to be called from JNI to send the BugReport.

*** Original change description ***

Automated rollback of commit a58fe3f.

*** Reason for rollback ***

Causing crashes for internal software that uses Bazel's VFS stuff.

*** Original change description ***

Fix most Unicode encoding bugs

Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes Strings as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the String contents, depending on the OS and availability of a Latin-1 locale.

***

PiperOrigin-RevId: 696524066
Change-Id: Ifdddacc08c1a81ad719b1aeac2a93882cbafbcd2
@Wyverald Wyverald requested a review from a team as a code owner November 15, 2024 20:40
@github-actions github-actions bot added team-Performance Issues for Performance teams team-Configurability platforms, toolchains, cquery, select(), config transitions team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Rules-Java Issues for Java rules team-Rules-CPP Issues for C++ rules team-Local-Exec Issues and PRs for the Execution (Local) team team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Nov 15, 2024
@iancha1992 iancha1992 added this pull request to the merge queue Nov 15, 2024
@iancha1992 iancha1992 added this to the 8.0.0 release blockers milestone Nov 15, 2024
Merged via the queue into release-8.0.0 with commit 2733cff Nov 15, 2024
@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Nov 15, 2024
@Wyverald Wyverald deleted the wyv-800-encoding branch November 16, 2024 00:02
rdesgroppes added a commit to rdesgroppes/rules_pkg that referenced this pull request Feb 8, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were
written with UTF-16LE encoding instead of UTF-8 on Windows:
- bazelbuild/bazel#24231
- bazelbuild/bazel#24350
- bazelbuild/bazel#24403

This led to disable failing tests in CI:

`//tests/zip:unicode_test`:
```
File "pkg\private\manifest.py", line 59, in read_entries_from
  raw_entries = json.loads(fh.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte
```

`//tests/mappings:utf8_manifest_test`:
```
File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch
  got = json.loads(g_fp.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte
```

Since the manifest is plain JSON, the fix simply consists in detecting
whether the second byte is `0`, where the default UTF-8 decoding would
fail, in which case we assume the file is UTF-16LE-encoded.

The code is slightly reorganized to factor out the encoding selection.

This allows to enable `//tests/mappings:utf8_manifest_test` and
`//tests/zip:unicode_test` tests in Windows CI.
rdesgroppes added a commit to rdesgroppes/rules_pkg that referenced this pull request Feb 10, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were
written with UTF-16LE encoding instead of UTF-8 on Windows:
- bazelbuild/bazel#24231
- bazelbuild/bazel#24350
- bazelbuild/bazel#24403

This led to disable failing tests in CI:

`//tests/zip:unicode_test`:
```
File "pkg\private\manifest.py", line 59, in read_entries_from
  raw_entries = json.loads(fh.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte
```

`//tests/mappings:utf8_manifest_test`:
```
File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch
  got = json.loads(g_fp.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte
```

Since the manifest is plain JSON, the fix simply consists in detecting
whether the second byte is `0`, where the default UTF-8 decoding would
fail, in which case we assume the file is UTF-16LE-encoded.

The code is slightly reorganized to factor out the encoding selection.

This allows to enable `//tests/mappings:utf8_manifest_test` and
`//tests/zip:unicode_test` tests in Windows CI.
rdesgroppes added a commit to rdesgroppes/rules_pkg that referenced this pull request Feb 10, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were
written with UTF-16LE encoding instead of UTF-8 on Windows:
- bazelbuild/bazel#24231
- bazelbuild/bazel#24350
- bazelbuild/bazel#24403

This led to disable failing tests in CI:

`//tests/zip:unicode_test`:
```
File "pkg\private\manifest.py", line 59, in read_entries_from
  raw_entries = json.loads(fh.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte
```

`//tests/mappings:utf8_manifest_test`:
```
File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch
  got = json.loads(g_fp.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte
```

Since the manifest is plain JSON, the fix simply consists in detecting
whether the second byte is `0`, where the default UTF-8 decoding would
fail, in which case we assume the file is UTF-16LE-encoded.

The code is slightly reorganized to factor out the encoding selection.

This allows to enable `//tests/mappings:utf8_manifest_test` and
`//tests/zip:unicode_test` tests in Windows CI.
tonyaiuto added a commit to bazelbuild/rules_pkg that referenced this pull request Feb 19, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were written with UTF-16LE encoding instead of UTF-8 on Windows:
- bazelbuild/bazel#24231
- bazelbuild/bazel#24350
- bazelbuild/bazel#24403

This led to disable failing tests in CI:

`//tests/zip:unicode_test`:
```
File "pkg\private\manifest.py", line 59, in read_entries_from
  raw_entries = json.loads(fh.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte
```

`//tests/mappings:utf8_manifest_test`:
```
File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch
  got = json.loads(g_fp.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte
```

Since the manifest is plain JSON, the fix simply consists in detecting
whether the second byte is `0`, where the default UTF-8 decoding would
fail, in which case we assume the file is UTF-16LE-encoded.

The code is slightly reorganized to factor out the encoding selection.

This allows to enable `//tests/mappings:utf8_manifest_test` and
`//tests/zip:unicode_test` tests in Windows CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-Configurability platforms, toolchains, cquery, select(), config transitions team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Local-Exec Issues and PRs for the Execution (Local) team team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team team-Rules-CPP Issues for C++ rules team-Rules-Java Issues for Java rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants