performance indexing stacked array

I have a large imaging dataset with ~30k images, each 40 MB. I'm combining these images into a large dask array by defining a delayed function for loading each individual image, then using `da.from_delayed` and a `da.stack`. The resulting array has dimensions (30265, 2, 1000, 10000) and chunksize (1, 2, 1000, 10000), for a total of ~120k tasks (4 tasks per image file). 

I noticed that indexing along the first axis of this array (i.e., along the chunked axis) takes 30 ms; i.e., calling `[x for x in array]` takes ~900s, which seems a bit on the long side, given that these tasks have 0 horizontal dependencies, and everything is lazy. By contrast, the same indexing pattern used on a dummy dataset (`da.zeros` combined with `stack`) with the same shape + dtype + chunking takes ~.700 ms per "image" with a total of ~9s for `[x for x in dummy_data]`. That's not lighting-fast but it's acceptable, unlike 900s.

Short-term it's clear that I should simply avoid the pattern of `big_array = da.stack(sub_arrays)...[b for b in big_array] ` when `big_array` is large... But I wonder a) why this pattern is so slow and b) if there's anything I can do (or dask can do) to speed up this access pattern. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

performance indexing stacked array #5913

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

performance indexing stacked array #5913

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions