-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
ENH: Delayed variant of persist (pin?) #2156
Description
It seems that storing an intermediate result now requires one call persist or cache, which trigger a computation in the background. Something that would be really nice is delayed caching (pinning?).
To elaborate a bit, pinning would mean that when computation is triggered by other means, the pinned object would be computed once and then persisted on the workers. However noting that the object should be pinned would not result in any computation on its own. Here's a simple example to demonstrate what this might mean.
In [1]: import dask.array as da
In [2]: import numpy as np
In [3]: a = da.from_array(np.arange(6), chunks=(6,))
In [4]: b = a * 2
In [5]: b = b.pin() # persist this later
In [6]: c = b + 3
In [7]: c.compute() # compute `c` and persist `b` in the process
Out[7]: array([ 3, 5, 7, 9, 11, 13])
In [8]: b.compute() # fetch `b`, it was already computed when `c` was
Out[8]: array([ 0, 2, 4, 6, 8, 10])
In [9]: del a, b, c # free memory, particularly `b` and `c` are needed to free `b`'s memoryNot to walk through all of this, but calling b.pin() notes that this value should be persisted on the workers once computation is triggered. Though otherwise it is just an annotation in the Dask graph. When we call c.compute(), this triggers computation, which will result in computing b as well. Once b is computed, its result is kept on the workers just as b.persist() does. So when we call b.compute(), it doesn't really compute anything. It just returns the result that was persisted in memory on the workers. To free the memory attached to b, we just release all references to it as we would do with persist.