[Dask.order] Ignore data tasks when ordering by fjetter · Pull Request #10619 · dask/dask

fjetter · 2023-11-08T12:36:21Z

This may be a little controversial... However, there are frequently topologies (particularly in the array space) that have a dummy task at the bottom of the graph that includes some metadata (e.g. for zarr). In the xarray world, those are frequently embedded numpy arrays.

I believe we should special case such tasks since they can throw off otherwise fine heuristics.

They never have dependencies so we can schedule them whenever we want
There is no point in running their dependents more quickly than others trying to release them. We cannot release their data since the data is embedded in the graph/run_spec (even a released task currently holds on to their run_spec)
The data itself is typically very small, otherwise it would not be feasible to embed it into a graph

So, why is this controversial

With this, ordering would be different for say da.from_numpy(np.zeros(100), chunks=20) and da.zeros(100, chunk=20) since the first would literally embed the numpy array into the dask graph while the latter generates the data whenever needed. I'm not sure if this is such a bad thing. It may just be a little surprising but I don't think this will have negative effects.
Those dummy / data tasks would now run immediately. This could cause tasks to pile up on few workers in a scale-up cluster scenario although this is a problem that should be fixed elsewhere IMO. This downside is currently purely an implementation details since for simplicity I just scheduled those first. This could be changed but would require some non-trivial code in the ordering which is why I would only want to do this if necessary.

Closes #10618

xref #10535

dcherian · 2023-11-10T19:12:13Z

In the xarray world, those are frequently embedded numpy arrays.

And wrappers around large on-disk arrays like netcdf/hdf/Zarr! These wrappers are small in memory but represent a large amount of data on disk

fjetter · 2023-12-14T17:00:18Z

Opened #10706 instead since it's a different implementation

fjetter added 2 commits November 8, 2023 13:18

Ignore data tasks when ordering

3d43d21

remove visualize

2d05b6f

fjetter mentioned this pull request Nov 30, 2023

Dask.order rewrite using a critical path approach #10660

Merged

This was referenced Dec 14, 2023

[Dask.order] Remove non-runnable leaf nodes from ordering #10697

Merged

[Dask.order] Ignore data tasks when ordering #10706

Merged

fjetter closed this Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dask.order] Ignore data tasks when ordering#10619

[Dask.order] Ignore data tasks when ordering#10619
fjetter wants to merge 2 commits intodask:mainfrom
fjetter:dask_order_ignore_data_task

fjetter commented Nov 8, 2023 •

edited

Loading

Uh oh!

dcherian commented Nov 10, 2023 •

edited

Loading

Uh oh!

fjetter commented Dec 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

fjetter commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian commented Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjetter commented Dec 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fjetter commented Nov 8, 2023 •

edited

Loading

dcherian commented Nov 10, 2023 •

edited

Loading