-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Checking --disk_cache cache is slow for huge tree artifacts #17804
Description
Description of the bug:
When an action does not hit the local action cache and it has an input that is a huge (hundreds of thousands of files and gigabytes of data) tree artifact, it takes a lot of time to check whether the action is in --disk_cache. CPU use is pegged at 100% in these cases.
The CPU time seems to be spent in MerkleTree.build(ActionInput). Its implementation seems to be single-threaded so it's not a big surprise that it's slow.
The report I heard about this said that this happens when a BUILD file is changed, which is consistent with losing the in-memory metadata cache in Skyframe which makes it necessary to reconstruct the metadata from the data on disk.
I can imagine further clever optimizations, but doing the Merkle tree computation in parallel would be the first step. Doesn't look too complicated, although managing another thread pool would be annoying and we can't quite wait until Loom comes online.
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
(reproduction is internal to Google, cannot copy-paste it here)
Which operating system are you running Bazel on?
Linux
What is the output of bazel info release?
unknown
If bazel info release returns development version or (@non-git), tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response