Skip to content

Large memory increase and processing slowness during graph creation #7851

@chrisroat

Description

@chrisroat

What happened:

When creating a graph consisting of delayed dataframes created from 20k image blocks, the memory climbs into 10+ gigabytes and it takes several minutes to create. No execution of graph is done. I experimented with fewer blocks, and the performance is quite non-linear. The processing time and memory taken up (measured via GKE jupyterhub pod memory usage) by number of blocks:

1000  =>   3s
2000  =>   9s
4000  =>  28s, 0.6GB
8000  =>  93s, 2.4GB
16000 => 324s, 9.1GB

What you expected to happen:

I expected 20k blocks this to fit under a minute and under a gigabyte. I also expected linear performance with number of blocks, as the blocks are independent (as can be verified by .visualize(..) on a single dataframe)

Minimal Complete Verifiable Example:

Scaled down example with 8k blocks, which is enough to demonstrate the memory growth. My actual dataset has 20k blocks, roughly 3TB divided into 150MB blocks -- but the problem seems to be purely a function of block count, not datasize.

import dask
import dask.array as da
import dask.dataframe as dd
import numpy as np

image = da.zeros(8000, dtype=np.uint16, chunks=1)
block_iter = zip(np.ndindex(*image.numblocks), image.to_delayed().flatten())

ddf_all = np.empty(image.numblocks, dtype=object)
for idx_chunk, chunk in block_iter:
    ddf_delayed = dask.delayed(lambda x: None)(chunk)
    ddf_all[idx_chunk] = dd.from_delayed(ddf_delayed, meta=[("z", np.float32)]) 

Environment:

  • Dask version: 2021.06.2
  • Python version: 3.8.10
  • Operating System: Ubuntu 18.04
  • Install method (conda, pip, source): conda

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions