Broadcast-like operations are poorly scheduled (widely-shared dependencies)

Graphs like this are not currently scheduled well:
```
. . . . . . . .   . . . . . . . .
|\|\|\|\|/|/|/|   |\|\|\|\|/|/|/|
| | | | a | | |   | | | | b | | |
* * * * * * * *   * * * * * * * *
```
The `.` tasks should definitely take into account the location of the `*` data when scheduling. But if we have 5 workers, every worker will have `*` data on it, but only 2 workers will have an `a` or `b`. In scheduling the first few `.`s, there's a tug-of-war between the `a` and the `*`—which do we want to schedule near? We want a way to disregard the `a`.

Say `(*, 0)` completes first, and `a` is already complete, on a different worker. Each `*` is the same size (or smaller than) `a`. We now schedule `(., 0)`. If we choose to go to `a`, we might have a short-term gain, but we've taken a spot that could have gone to better use in the near future. Say the worker holding `a` is already running `(*, 6)`. Now, `(., 6)` may get scheduled on yet another worker, because `(., 0)` is already running where it should have gone, and the scheduler prioritizes "where can I start this task soonest" over "how can I minimize data transfer".

This can cascade through all the `.`s, until we've transferred most root tasks to different workers (on top of `a`, which we have to transfer everywhere no matter what).

**What could have been a nearly-zero-transfer operation is instead likely to transfer every piece of input data to a different worker, greatly increasing memory usage.**

This pattern will occur anytime you broadcast one thing against another in a binary operation (which can occur in arrays, dataframes, bags, etc.).

```python
import dask.array as da
a = da.random.random(100, chunks=10)
x = da.random.random(1)
r = (a[1:] * x)  # `[1:]` slicing prevents blockwise fusion
r.visualize(optimize_graph=True, collapse_outputs=True)
```
![mydask](https://user-images.githubusercontent.com/3309802/173454530-8d95b4bb-c2ac-4ab3-9ba4-738ae07dd519.png)

In the above case, the `mul` tasks will tend to "dogpile" onto the one worker that holds the middle `random_sample` task (`x`).

@crusaderky has also observed cases where this "dogpile" effect can cause what should be an embarrassingly-parallell operation to all get scheduled on one worker, overwhelming it.

https://github.com/dask/distributed/pull/5325 was a heuristic attempt to fix this, but there are probably better ways to approach it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Broadcast-like operations are poorly scheduled (widely-shared dependencies) #6570

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Broadcast-like operations are poorly scheduled (widely-shared dependencies) #6570

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions