-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Repo rules fail to extract unicode archives due to latin-1 hack #12986
Description
Description of the problem / feature request:
rules_go currently fails on some machines due to some unicode characters included in filenames within the Go source archive - specifically the character Ä. For Linux and macOS, Go archives are distributed as tar.gz files with pax headers in the tar files.
Affected systems include:
- ZFS volumes with the
utf8onlyoption. This is Ubuntu's default when choosing ZFS at install time. - macOS (HFS+ and APFS require UTF-8 filenames)
Bazel uses Apache Commons Compress to extract tar archives. For most tar files, Commons Compress defers to the encoding specified by the JVM's -Dfile.encoding param, or the platform default. With ISO-8859-1 - Bazel's preference - UTF-8 encoded filename bytes in tar files basically pass through verbatim when extracted and everything works.
But when the tar entry has a pax path header, the path name is always decoded as UTF-8.
The character Ä in its composed form is unicode character U+00C4. In Java's internal UTF-16, it's simply represented as 0xC4. But encoded as UTF-8 it becomes the multi-byte sequence 0xC3 0x84 since 0xC4 as a single byte is not a valid UTF-8 value.
When Commons Compress parses a pax-formatted tar file with a filename containing Ä as the 0xC3 0x84 UTF-8 string, the resulting Java string contains the value 0xC4 after decoding. But this value is never re-encoded as UTF-8 when creating the file on the filesystem. Instead, Bazel uses the Java char values verbatim as long as they're < 0xff. An attempt to create a filename containing 0xC4 on a filesystem that requires UTF-8 filenames will fail.
But rules_go doesn't currently fail on macOS systems despite them also requiring UTF-8 filenames. This is because the darwin archives use decomposed representations of unicode characters. OS X has a history of preferring the decomposed forms over composed.
So instead of Ä being U+00C4 ("Latin Capital Letter A with Diaeresis"), it's U+0041 (just capital A) followed by U+0308 ("Combining Diaeresis"). Encoded in UTF-8 as seen in the macOS golang tarballs, the byte string is 0x41 0xCC 0x88. Decoded to a Java string (16-bit chars) it's 0x0041 0x0308. Coincidentally I presume, Bazel is able to extract this decomposed form on UTF-8 filesystems because it ignores the diaeresis and replaces it with a literal '?' character. So instead of Äfoo.go as is contained in the Go source archive, Bazel writes A?foo.go on macOS.
Like Linux on ZFS, Bazel fails to extract the linux archive on macOS as reported here.
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
test.bzl
def _tar_round_trip_impl(ctx):
ctx.file("Äfoo.txt", "boo!\n")
ctx.execute(["tar", "--format=" + ctx.attr.format, "-czvf", "file.tar.gz", "Äfoo.txt"])
ctx.extract("file.tar.gz", "out")
ctx.file("BUILD.bazel", 'exports_files(["Äfoo.txt", "out/Äfoo.txt", "file.tar.gz"])', legacy_utf8=False)
tar_round_trip = repository_rule(
implementation = _tar_round_trip_impl,
attrs = {
"format": attr.string(
mandatory = True,
),
},
)WORKSPACE
load("//:test.bzl", "tar_round_trip")
tar_round_trip(
name = "non_pax",
format = "ustar", # supported by both macos BSD tar and linux GNU tar
)
tar_round_trip(
name = "pax",
format = "pax",
)BUILD
genrule(
name = "non_pax_test",
srcs = ["@non_pax//:out/Äfoo.txt"],
outs = ["non_pax.txt"],
cmd = """
cp $(location @non_pax//:out/Äfoo.txt) "$@"
""",
)
genrule(
name = "pax_test",
srcs = ["@pax//:out/Äfoo.txt"],
outs = ["pax.txt"],
cmd = """
cp $(location @pax//:out/Äfoo.txt) "$@"
""",
)# Works
bazel build //:non_pax_test
# Fails, either due to not being able to write the file (utf8 filesystem),
# or because the written filename is mangled.
bazel build //:pax_testWhat operating system are you running Bazel on?
Mac OS 10.15.7
Ubuntu 20.04 with a ZFS root partition with utf8only enabled (the default for Ubuntu's ZFS support).
What's the output of bazel info release?
release 4.0.0
Have you found anything relevant by searching the web?
bazel-contrib/rules_go#2771
#374 - pretty generic issue regarding filename characters.
#7055 - an issue with the same problematic file in the Go archive, but targeted at Darwin only.