Skip to content

Bazel deadlocks hosts with large numbers of cores #11868

@stellaraccident

Description

@stellaraccident

Description of the problem / feature request:

Bazel appears to be incredibly sensitive to the number of cores available on the local machine: at high core counts (>=64), bazel will predictably deadlock, often bringing the machine down with it, if there is any I/O latency or contention. We have experienced this consistently on multiple Linux hosts, both physical and virtual -- although it happens at a much greater frequency on virtual machines, even when provisioned to be fairly isolated and running on SSDs. On a GCE n1-highcpu-64 (96 core, 57GB RAM, 100GB SSD VM, we can trigger this deadlock for roughly 80% of builds of our project (which includes part of TensorFlow and LLVM as deps, and has on the order of ~7000 actions). In such a configuration, the deadlock usually occurs after ~2000-3000 actions and always with the same pattern. Bazel will report that "128 jobs running" but watching top will show very low utilization (say 16-30 processes), high CPU usage of the Bazel Java process (200-400%), and the tendency for jobs to "get lost", eventually with no active jobs running (from the perspective of top).

On internal developer specialist workstations (with normal disks, not SSD), the occurrence is 100% and happens much sooner in a build.

I have found two workarounds that help the problem:

  1. Set the --output_base to a directory on tmpfs (under /dev/shm), making sure that it is sized appropriately.
  2. Set --spawn_strategy=standalone

For the first, this seems to make things relatively reliable. Looking at top, bazel gets close to saturating all cores (90-95% consistent utilization). For the second, it seems to help but deadlocks still occur at a lower rate. When not triggering a deadlock, the utilization seems relatively high. I have a low sample count for the second option.

Note also that we maintain roughly parallel CMake/Ninja builds for a substantial fraction of the code, have never had any such issues with it, and in general, it is much more reliable at utilizing all available cores for cc compilation jobs than Bazel is. This is a fairly apples to apples comparison running on the same systems.

I have no real knowledge of bazel internals, but all of the evidence I have seen suggests that at high core counts, Bazel is extremely sensitive to I/O latency, the presence of which exacerbates some kind of locking issue which can turn into a flood which does something really bad causing machines to become unresponsive with no obvious resource contention. I have seen such machines eventually wake back up after an hour or so on occasion if an external agent kills processes.

Feature requests: what underlying problem are you trying to solve with this feature?

Bazel should operate reliably regardless of the machine size.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Comment out this line and run our build pipeline. Based on internal, ad-hoc testing, I suspect that this can be easily triggered on affected machines by building TensorFlow, or another such project with ~thousands of actions.

Alternatively, building our project on such a machine repros easily: https://google.github.io/iree/get-started/getting-started-linux-bazel

What operating system are you running Bazel on?

Various. Most commonly Debian 10.

What's the output of bazel info release?

Various - we've experienced this over many versions over ~months.

Here is one:
release 3.3.1

If bazel info release returns "development version" or "(@Non-Git)", tell us how you built Bazel.

N/A - although I have also experienced this with custom build bazel versions on exotic configs.

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

https://github.com/google/iree.git
c96bbb1d38d3fe81230e38ce3214d80b922ba4c3
c96bbb1d38d3fe81230e38ce3214d80b922ba4c3

Have you found anything relevant by searching the web?

No.

Any other information, logs, or outputs that you want to share?

I can followup with artifacts you think might be valuable. I have not found anything worthwhile, and when it gets into a really bad state, I'm often on a remote ssh connection and the machine locks to the point that it is hard to do much.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions