Skip to content

Repo rules fail to extract unicode archives due to latin-1 hack #12986

@jvolkman

Description

@jvolkman

Description of the problem / feature request:

rules_go currently fails on some machines due to some unicode characters included in filenames within the Go source archive - specifically the character Ä. For Linux and macOS, Go archives are distributed as tar.gz files with pax headers in the tar files.

Affected systems include:

  • ZFS volumes with the utf8only option. This is Ubuntu's default when choosing ZFS at install time.
  • macOS (HFS+ and APFS require UTF-8 filenames)

Bazel uses Apache Commons Compress to extract tar archives. For most tar files, Commons Compress defers to the encoding specified by the JVM's -Dfile.encoding param, or the platform default. With ISO-8859-1 - Bazel's preference - UTF-8 encoded filename bytes in tar files basically pass through verbatim when extracted and everything works.

But when the tar entry has a pax path header, the path name is always decoded as UTF-8.

The character Ä in its composed form is unicode character U+00C4. In Java's internal UTF-16, it's simply represented as 0xC4. But encoded as UTF-8 it becomes the multi-byte sequence 0xC3 0x84 since 0xC4 as a single byte is not a valid UTF-8 value.

When Commons Compress parses a pax-formatted tar file with a filename containing Ä as the 0xC3 0x84 UTF-8 string, the resulting Java string contains the value 0xC4 after decoding. But this value is never re-encoded as UTF-8 when creating the file on the filesystem. Instead, Bazel uses the Java char values verbatim as long as they're < 0xff. An attempt to create a filename containing 0xC4 on a filesystem that requires UTF-8 filenames will fail.

But rules_go doesn't currently fail on macOS systems despite them also requiring UTF-8 filenames. This is because the darwin archives use decomposed representations of unicode characters. OS X has a history of preferring the decomposed forms over composed.

So instead of Ä being U+00C4 ("Latin Capital Letter A with Diaeresis"), it's U+0041 (just capital A) followed by U+0308 ("Combining Diaeresis"). Encoded in UTF-8 as seen in the macOS golang tarballs, the byte string is 0x41 0xCC 0x88. Decoded to a Java string (16-bit chars) it's 0x0041 0x0308. Coincidentally I presume, Bazel is able to extract this decomposed form on UTF-8 filesystems because it ignores the diaeresis and replaces it with a literal '?' character. So instead of Äfoo.go as is contained in the Go source archive, Bazel writes A?foo.go on macOS.

Like Linux on ZFS, Bazel fails to extract the linux archive on macOS as reported here.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

test.bzl

def _tar_round_trip_impl(ctx):
    ctx.file("Äfoo.txt", "boo!\n")
    ctx.execute(["tar", "--format=" + ctx.attr.format, "-czvf", "file.tar.gz", "Äfoo.txt"])
    ctx.extract("file.tar.gz", "out")
    ctx.file("BUILD.bazel", 'exports_files(["Äfoo.txt", "out/Äfoo.txt", "file.tar.gz"])', legacy_utf8=False)

tar_round_trip = repository_rule(
    implementation = _tar_round_trip_impl,
    attrs = {
	"format": attr.string(
            mandatory = True,
        ),
    },
)

WORKSPACE

load("//:test.bzl", "tar_round_trip")

tar_round_trip(
    name = "non_pax",
    format = "ustar",  # supported by both macos BSD tar and linux GNU tar
)

tar_round_trip(
    name = "pax",
    format = "pax",
)

BUILD

genrule(
    name = "non_pax_test",
    srcs = ["@non_pax//:out/Äfoo.txt"],
    outs = ["non_pax.txt"],
    cmd = """
        cp $(location @non_pax//:out/Äfoo.txt) "$@"
    """,
)

genrule(
    name = "pax_test",
    srcs = ["@pax//:out/Äfoo.txt"],
    outs = ["pax.txt"],
    cmd = """
	cp $(location @pax//:out/Äfoo.txt) "$@"
    """,
)
# Works
bazel build //:non_pax_test

# Fails, either due to not being able to write the file (utf8 filesystem),
# or because the written filename is mangled.
bazel build //:pax_test

What operating system are you running Bazel on?

Mac OS 10.15.7
Ubuntu 20.04 with a ZFS root partition with utf8only enabled (the default for Ubuntu's ZFS support).

What's the output of bazel info release?

release 4.0.0

Have you found anything relevant by searching the web?

bazel-contrib/rules_go#2771
#374 - pretty generic issue regarding filename characters.
#7055 - an issue with the same problematic file in the Go archive, but targeted at Darwin only.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P4This is either out of scope or we don't have bandwidth to review a PR. (No assignee)help wantedSomeone outside the Bazel team could own thisplatform: appleteam-Starlark-IntegrationIssues involving Bazel's integration with Starlark, excluding builtin symbolstype: bug

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions