Skip to content

performance indexing stacked array #5913

@d-v-b

Description

@d-v-b

I have a large imaging dataset with ~30k images, each 40 MB. I'm combining these images into a large dask array by defining a delayed function for loading each individual image, then using da.from_delayed and a da.stack. The resulting array has dimensions (30265, 2, 1000, 10000) and chunksize (1, 2, 1000, 10000), for a total of ~120k tasks (4 tasks per image file).

I noticed that indexing along the first axis of this array (i.e., along the chunked axis) takes 30 ms; i.e., calling [x for x in array] takes ~900s, which seems a bit on the long side, given that these tasks have 0 horizontal dependencies, and everything is lazy. By contrast, the same indexing pattern used on a dummy dataset (da.zeros combined with stack) with the same shape + dtype + chunking takes ~.700 ms per "image" with a total of ~9s for [x for x in dummy_data]. That's not lighting-fast but it's acceptable, unlike 900s.

Short-term it's clear that I should simply avoid the pattern of big_array = da.stack(sub_arrays)...[b for b in big_array] when big_array is large... But I wonder a) why this pattern is so slow and b) if there's anything I can do (or dask can do) to speed up this access pattern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions