Skip to content

Ordering changes negatively affecting memory use #5859

@JSKenyon

Description

@JSKenyon

I will attempt to follow this report up with an example. I encountered this behaviour in a relatively complicated example and replicating it in a simpler case is eluding me. I am using dask (2.9.1/2.10.1, see remainder of post), distributed 2.10.0 and Python 3.6.9.

My specific problem involves reading large amounts of data from disk, processing it in independent chunks and then writing the results to disk again. Prior to dask 2.9.2 the memory usage of my code behaved as expected, exhibiting a typical heartbeat pattern as data was read - processed - written.

Post dask 2.9.2, the above is no longer the case and the maximum memory footprint has increased, often substantially. I believe that this is a result of changes to dask.order.order made here.

As an example of what I believe is causing this behaviour, consider the following graph:

   E          # Write operation.
   |   
   C D        # C: Additional computation. D: Write operation. 
   |/
   B          # Computation producing two outputs.
   |
   A          # Read operation.

dask.compute is then called on [D, E]. In my head (and in the past), this has worked to write two data products to disk per input chunk without growing extensively in memory. However, as of 2.9.2, it seems that the scheduler is no longer giving priority to node D. The result is that all those write operations are being kept in memory until every node E is completed. This does not seem like desirable behaviour.

There is one additional peculiarity - this memory growth is only evident when using the distributed scheduler. I cannot think of any explanation for this as its version was unchanged across my tests.

To boil it down - has anyone else experienced similar changes in behaviour? Is it plausible that changes to ordering could have this effect? Finally, does anyone have any advice for producing an MRE?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions