Skip to content

Tree artifact up-to-dateness check can be very slow for large tree artifacts #17009

@lberki

Description

@lberki

Description of the bug:

When checking whether a local action cache entry is up-to-date, it takes a long time to check actions that have large tree artifacts on their inputs. The stack trace when Bazel is working on this is:

   java.lang.Thread.State: RUNNABLE
        at java.io.FileInputStream.readBytes(java.base@11.0.6/Native Method)
        at java.io.FileInputStream.read(java.base@11.0.6/Unknown Source)
        at com.google.common.io.ByteStreams.copy(ByteStreams.java:114)
        at com.google.common.io.ByteSource.copyTo(ByteSource.java:257)
        at com.google.common.io.ByteSource.hash(ByteSource.java:340)
        at com.google.devtools.build.lib.vfs.FileSystem.getDigest(FileSystem.java:339)
        at com.google.devtools.build.lib.unix.UnixFileSystem.getDigest(UnixFileSystem.java:452)
        at com.google.devtools.build.lib.vfs.Path.getDigest(Path.java:690)
        at com.google.devtools.build.lib.vfs.DigestUtils.manuallyComputeDigest(DigestUtils.java:194)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.constructFileArtifactValue(ActionMetada
taHandler.java:564)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.constructFileArtifactValueFromFilesyste
m(ActionMetadataHandler.java:496)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.lambda$constructTreeArtifactValueFromFi
lesystem$0(ActionMetadataHandler.java:354)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler$$Lambda$1121/0x0000000800857040.visit(Unknown Source)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:411)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:414)
        at com.google.devtools.build.lib.skyframe.TreeArtifactValue.visitTree(TreeArtifactValue.java:393)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.constructTreeArtifactValueFromFilesystem(ActionMetadataHandler.java:342)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.getTreeArtifactValue(ActionMetadataHandler.java:317)
        at com.google.devtools.build.lib.skyframe.ActionMetadataHandler.getMetadata(ActionMetadataHandler.java:265)
        at com.google.devtools.build.lib.actions.ActionCacheChecker.getMetadataOrConstant(ActionCacheChecker.java:566)
        at com.google.devtools.build.lib.actions.ActionCacheChecker.getMetadataMaybe(ActionCacheChecker.java:579)
        at com.google.devtools.build.lib.actions.ActionCacheChecker.validateArtifacts(ActionCacheChecker.java:207)
        at com.google.devtools.build.lib.actions.ActionCacheChecker.mustExecute(ActionCacheChecker.java:541)

My theory is that this is because the visitation happens on a single thread in TreeArtifactValue.visitTree() when called from ActionMetadataHandler.constructTreeArtifactValueFromFilesystem().

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Take this BUILD file:

touch WORKSPACE
mkdir -p r
cat > r/BUILD <<'EOF'
load(":r.bzl", "r")

r(name = "ta")

genrule(
    name = "c",
    srcs = [":ta"],
    outs = ["co"],
    cmd = "find $(location :ta) > $@",
)

sh_binary(
    name = "gen",
    srcs = ["gen.sh"],
)
EOF

cat > r/r.bzl << 'EOF'
def _r_impl(ctx):
    ta = ctx.actions.declare_directory("d")
    ctx.actions.run(
        outputs = [ta],
        inputs = [],
        executable = ctx.executable._gen,
        arguments = [ta.path],
    )
    return [DefaultInfo(files = depset([ta]))]

r = rule(
    implementation = _r_impl,
    attrs = {
        "_gen": attr.label(default = "//r:gen", executable = True, cfg = "exec"),
    },
)
EOF

cat > r/gen.sh <<'EOF'
#!/bin/bash

OUT="$1"
mkdir -p "$OUT"

for i in $(seq 1 10); do
  for j in $(seq 1 10); do
    for k in $(seq 1 100); do
      mkdir -p "$OUT/$i/$j"
      #echo "$i $j $k" > "$OUT/$i/$j/$k"
      dd if=/dev/random of="$OUT/$i/$j/$k" bs=1024 count=1024
    done
  done
done
echo hello > "$OUT/hello"
EOF

chmod +x r/gen.sh

bazel build //r:c
bazel shutdown
bazel build //r:c  # This is slow

Which operating system are you running Bazel on?

Linux @ Google

What is the output of bazel info release?

development version

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

From git commit de4746d .

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3We're not considering working on this, but happy to review a PR. (No assignee)team-Remote-ExecIssues and PRs for the Execution (Remote) teamtype: bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions