Skip to content

[8.0.0] Fix more Unicode bugs#24403

Merged
meteorcloudy merged 2 commits intobazelbuild:release-8.0.0from
fmeum:24242-cherry
Nov 20, 2024
Merged

[8.0.0] Fix more Unicode bugs#24403
meteorcloudy merged 2 commits intobazelbuild:release-8.0.0from
fmeum:24242-cherry

Conversation

@fmeum
Copy link
Copy Markdown
Collaborator

@fmeum fmeum commented Nov 19, 2024

  • Use Latin-1 in many native file write rules for consistency with the internal encoding.
  • Use Latin-1 for the resolved repository file and the JSON profile.
  • Fix unused_input_list handling of non-ASCII characters in file names.
  • Flip the legacy_utf8 parameter of repository_ctx.file to False and make it a no-op. With the previous default, any non-ASCII characters would be written out as double encoded UTF-8, which is not a useful choice.
  • Change repository_ctx.template to operate on raw bytes for consistency with repository_ctx.read and to fix substitution with non-ASCII keys/values.
  • Move some usages of UTF_8 closer to their usage site to clarify why they are correct.
  • Fixes parsing of dependency files with Unicode character contents (/showIncludes and .d files)

Closes #24182.

PiperOrigin-RevId: 698111811
Change-Id: Ie43bab9eb5963bf81690dd8985d358f544a711c9 (cherry picked from commit 3fdec93)

Fixes #24242

* Use Latin-1 in many native file write rules for consistency with the internal encoding.
* Use Latin-1 for the resolved repository file and the JSON profile.
* Fix `unused_input_list` handling of non-ASCII characters in file names.
* Flip the `legacy_utf8` parameter of `repository_ctx.file` to `False` and make it a no-op. With the previous default, any non-ASCII characters would be written out as double encoded UTF-8, which is not a useful choice.
* Change `repository_ctx.template` to operate on raw bytes for consistency with `repository_ctx.read` and to fix substitution with non-ASCII keys/values.
* Move some usages of `UTF_8` closer to their usage site to clarify why they are correct.
* Fixes parsing of dependency files with Unicode character contents (`/showIncludes` and `.d` files)

Closes bazelbuild#24182.

PiperOrigin-RevId: 698111811
Change-Id: Ie43bab9eb5963bf81690dd8985d358f544a711c9
(cherry picked from commit 3fdec93)
@fmeum fmeum requested a review from a team as a code owner November 19, 2024 23:20
@fmeum fmeum requested review from Wyverald and removed request for a team November 19, 2024 23:20
@github-actions github-actions bot added team-Performance Issues for Performance teams team-Configurability platforms, toolchains, cquery, select(), config transitions team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Rules-CPP Issues for C++ rules awaiting-review PR is awaiting review from an assigned reviewer labels Nov 19, 2024
@fmeum fmeum requested a review from a team November 19, 2024 23:20
@fmeum fmeum requested a review from tjgq November 20, 2024 11:38
@meteorcloudy meteorcloudy added this pull request to the merge queue Nov 20, 2024
Merged via the queue into bazelbuild:release-8.0.0 with commit ed38e47 Nov 20, 2024
@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Nov 20, 2024
@fmeum fmeum deleted the 24242-cherry branch November 20, 2024 12:40
rdesgroppes added a commit to rdesgroppes/rules_pkg that referenced this pull request Feb 8, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were
written with UTF-16LE encoding instead of UTF-8 on Windows:
- bazelbuild/bazel#24231
- bazelbuild/bazel#24350
- bazelbuild/bazel#24403

This led to disable failing tests in CI:

`//tests/zip:unicode_test`:
```
File "pkg\private\manifest.py", line 59, in read_entries_from
  raw_entries = json.loads(fh.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte
```

`//tests/mappings:utf8_manifest_test`:
```
File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch
  got = json.loads(g_fp.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte
```

Since the manifest is plain JSON, the fix simply consists in detecting
whether the second byte is `0`, where the default UTF-8 decoding would
fail, in which case we assume the file is UTF-16LE-encoded.

The code is slightly reorganized to factor out the encoding selection.

This allows to enable `//tests/mappings:utf8_manifest_test` and
`//tests/zip:unicode_test` tests in Windows CI.
rdesgroppes added a commit to rdesgroppes/rules_pkg that referenced this pull request Feb 10, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were
written with UTF-16LE encoding instead of UTF-8 on Windows:
- bazelbuild/bazel#24231
- bazelbuild/bazel#24350
- bazelbuild/bazel#24403

This led to disable failing tests in CI:

`//tests/zip:unicode_test`:
```
File "pkg\private\manifest.py", line 59, in read_entries_from
  raw_entries = json.loads(fh.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte
```

`//tests/mappings:utf8_manifest_test`:
```
File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch
  got = json.loads(g_fp.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte
```

Since the manifest is plain JSON, the fix simply consists in detecting
whether the second byte is `0`, where the default UTF-8 decoding would
fail, in which case we assume the file is UTF-16LE-encoded.

The code is slightly reorganized to factor out the encoding selection.

This allows to enable `//tests/mappings:utf8_manifest_test` and
`//tests/zip:unicode_test` tests in Windows CI.
rdesgroppes added a commit to rdesgroppes/rules_pkg that referenced this pull request Feb 10, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were
written with UTF-16LE encoding instead of UTF-8 on Windows:
- bazelbuild/bazel#24231
- bazelbuild/bazel#24350
- bazelbuild/bazel#24403

This led to disable failing tests in CI:

`//tests/zip:unicode_test`:
```
File "pkg\private\manifest.py", line 59, in read_entries_from
  raw_entries = json.loads(fh.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte
```

`//tests/mappings:utf8_manifest_test`:
```
File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch
  got = json.loads(g_fp.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte
```

Since the manifest is plain JSON, the fix simply consists in detecting
whether the second byte is `0`, where the default UTF-8 decoding would
fail, in which case we assume the file is UTF-16LE-encoded.

The code is slightly reorganized to factor out the encoding selection.

This allows to enable `//tests/mappings:utf8_manifest_test` and
`//tests/zip:unicode_test` tests in Windows CI.
tonyaiuto added a commit to bazelbuild/rules_pkg that referenced this pull request Feb 19, 2026
Prior to Bazel 8, manifest files containing non-ASCII characters were written with UTF-16LE encoding instead of UTF-8 on Windows:
- bazelbuild/bazel#24231
- bazelbuild/bazel#24350
- bazelbuild/bazel#24403

This led to disable failing tests in CI:

`//tests/zip:unicode_test`:
```
File "pkg\private\manifest.py", line 59, in read_entries_from
  raw_entries = json.loads(fh.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 338: invalid start byte
```

`//tests/mappings:utf8_manifest_test`:
```
File "tests\mappings\manifest_test_lib.py", line 39, in assertManifestsMatch
  got = json.loads(g_fp.read())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 354: invalid start byte
```

Since the manifest is plain JSON, the fix simply consists in detecting
whether the second byte is `0`, where the default UTF-8 decoding would
fail, in which case we assume the file is UTF-16LE-encoded.

The code is slightly reorganized to factor out the encoding selection.

This allows to enable `//tests/mappings:utf8_manifest_test` and
`//tests/zip:unicode_test` tests in Windows CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-Configurability platforms, toolchains, cquery, select(), config transitions team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Performance Issues for Performance teams team-Rules-CPP Issues for C++ rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants