-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Description
Recently we started culling dask.array graphs when slicing if the output array had many fewer output chunks than the input array. There is a tradeoff here between copying the full .dask attribute and culling it. Copying is cheaper per node but operates on all of the nodes. Culling is much more expensive per node but operates only on the nodes that survive the process. Additionally culling is also better downstream because future computations will have less to deal with.
However, there are some fail cases. Here is a good one:
from dask import array as da
z = da.ones((100, 100), chunks=(100, 1))
z = z.rechunk((1, 100))
z = z.rechunk((100, 1))
z = z.rechunk((1, 100))
z = z.rechunk((100, 1))
z = z.rechunk((1, 100))
z = z.rechunk((100, 1))
z = z.rechunk((1, 100))
%prun z[0] 818565 function calls (576944 primitive calls) in 0.680 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
272107/60802 0.281 0.000 0.489 0.000 core.py:159(_deps)
30903/602 0.173 0.000 0.284 0.000 core.py:186(<listcomp>)
1 0.113 0.113 0.651 0.651 optimize.py:14(cull)
60801 0.030 0.000 0.519 0.000 core.py:197(get_dependencies)
1 0.027 0.027 0.680 0.680 <string>:1(<module>)
30905 0.022 0.000 0.022 0.000 {built-in method builtins.sum}
241204 0.013 0.000 0.013 0.000 {built-in method builtins.callable}
60801 0.008 0.000 0.008 0.000 {method 'pop' of 'list' objects}
60810 0.008 0.000 0.008 0.000 {method 'add' of 'set' objects}
60800 0.003 0.000 0.003 0.000 {method 'append' of 'list' objects}
1 0.001 0.001 0.653 0.653 core.py:1104(__getitem__)
5/1 0.000 0.000 0.000 0.000 abc.py:194(__subclasscheck__)
1 0.000 0.000 0.000 0.000 slicing.py:240(slice_slices_and_integers)
This is run under @pitrou 's branch #1731
Some options:
- We create a better fast check that allows us to approximate the size of the output graph that doesn't rely on
get_dependencies(slow) and use that to decide if we should cull or copy - We provide globals to decide to cull or not (though this feels like bloat to me)
- ??
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels