-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Perform all graph optimizations for all collection types #11854
Description
Discussion originally started here: https://dask.discourse.group/t/optimal-graph-optimization-when-mixing-dask-objects/3867
Short summary is that there are cases where a single graph consists of tasks from different types and they may not be optimally optimized. For example, a series of Array operations ending in a Delayed task will be optimized as a Delayed task and miss all the possible Array graph optimizations. This gets even more complicated when da.store is involved.
I don't have the larger context for the discussion in issues like #11458 to understand it and the help I got on discourse lead to filing this issue for more maintainers eyes. The main question is, can all collection optimizations always be performed during optimizations? In my real world use case, in a simple data case, I see a 35+% improvement in execution time by forcing Array optimizations on a graph that ends in a Delayed object.
Here is a basic example showing the issue:
import dask
import dask.array as da
@dask.delayed(pure=True)
def delayed_func(arr1):
return arr1 + 1
start = da.zeros((2, 2), chunks=1)
subarr1 = start[:1]
subarr2 = subarr1[:, :1]
delay_result = delayed_func(subarr2)
assert len(delay_result.dask.keys()) == 9 # zeros * 4 -> getitem * 3 -> finalize -> delayed_func
if True:
# current dask
assert len(dask.optimize(delay_result)[0].dask.keys()) == 5 # zeros * 1 -> getitem * 2 -> finalize -> delayed_func
else:
# if Arrays were detected in Delayed graphs:
assert len(dask.optimize(delay_result)[0].dask.keys()) == 2 # finalize-getitem-zeros -> delayed_func
# use array optimize instead of delayed
with dask.config.set(delayed_optimize=da.optimize):
assert len(dask.optimize(delay_result)[0].dask.keys()) == 2 # finalize-getitem-zeros -> delayed_funcRelated: