Skip to content

ENH: Delayed variant of persist (pin?)  #2156

@jakirkham

Description

@jakirkham

It seems that storing an intermediate result now requires one call persist or cache, which trigger a computation in the background. Something that would be really nice is delayed caching (pinning?).

To elaborate a bit, pinning would mean that when computation is triggered by other means, the pinned object would be computed once and then persisted on the workers. However noting that the object should be pinned would not result in any computation on its own. Here's a simple example to demonstrate what this might mean.

In [1]: import dask.array as da

In [2]: import numpy as np

In [3]: a = da.from_array(np.arange(6), chunks=(6,))

In [4]: b = a * 2

In [5]: b = b.pin()  # persist this later

In [6]: c = b + 3

In [7]: c.compute()  # compute `c` and persist `b` in the process
Out[7]: array([ 3,  5,  7,  9, 11, 13])

In [8]: b.compute()  # fetch `b`, it was already computed when `c` was
Out[8]: array([ 0,  2,  4,  6,  8, 10])

In [9]: del a, b, c  # free memory, particularly `b` and `c` are needed to free `b`'s memory

Not to walk through all of this, but calling b.pin() notes that this value should be persisted on the workers once computation is triggered. Though otherwise it is just an annotation in the Dask graph. When we call c.compute(), this triggers computation, which will result in computing b as well. Once b is computed, its result is kept on the workers just as b.persist() does. So when we call b.compute(), it doesn't really compute anything. It just returns the result that was persisted in memory on the workers. To free the memory attached to b, we just release all references to it as we would do with persist.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions