Skip to content

[Dask.order] Ignore data tasks when ordering#10619

Closed
fjetter wants to merge 2 commits intodask:mainfrom
fjetter:dask_order_ignore_data_task
Closed

[Dask.order] Ignore data tasks when ordering#10619
fjetter wants to merge 2 commits intodask:mainfrom
fjetter:dask_order_ignore_data_task

Conversation

@fjetter
Copy link
Copy Markdown
Member

@fjetter fjetter commented Nov 8, 2023

This may be a little controversial... However, there are frequently topologies (particularly in the array space) that have a dummy task at the bottom of the graph that includes some metadata (e.g. for zarr). In the xarray world, those are frequently embedded numpy arrays.

I believe we should special case such tasks since they can throw off otherwise fine heuristics.

  1. They never have dependencies so we can schedule them whenever we want
  2. There is no point in running their dependents more quickly than others trying to release them. We cannot release their data since the data is embedded in the graph/run_spec (even a released task currently holds on to their run_spec)
  3. The data itself is typically very small, otherwise it would not be feasible to embed it into a graph

So, why is this controversial

  1. With this, ordering would be different for say da.from_numpy(np.zeros(100), chunks=20) and da.zeros(100, chunk=20) since the first would literally embed the numpy array into the dask graph while the latter generates the data whenever needed. I'm not sure if this is such a bad thing. It may just be a little surprising but I don't think this will have negative effects.
  2. Those dummy / data tasks would now run immediately. This could cause tasks to pile up on few workers in a scale-up cluster scenario although this is a problem that should be fixed elsewhere IMO. This downside is currently purely an implementation details since for simplicity I just scheduled those first. This could be changed but would require some non-trivial code in the ordering which is why I would only want to do this if necessary.

Closes #10618

xref #10535

@dcherian
Copy link
Copy Markdown
Collaborator

dcherian commented Nov 10, 2023

In the xarray world, those are frequently embedded numpy arrays.

And wrappers around large on-disk arrays like netcdf/hdf/Zarr! These wrappers are small in memory but represent a large amount of data on disk

@fjetter
Copy link
Copy Markdown
Member Author

fjetter commented Dec 14, 2023

Opened #10706 instead since it's a different implementation

@fjetter fjetter closed this Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Dask.order] Memory usage regression for flox xarray reductions

2 participants