-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
I have a large imaging dataset with ~30k images, each 40 MB. I'm combining these images into a large dask array by defining a delayed function for loading each individual image, then using da.from_delayed and a da.stack. The resulting array has dimensions (30265, 2, 1000, 10000) and chunksize (1, 2, 1000, 10000), for a total of ~120k tasks (4 tasks per image file).
I noticed that indexing along the first axis of this array (i.e., along the chunked axis) takes 30 ms; i.e., calling [x for x in array] takes ~900s, which seems a bit on the long side, given that these tasks have 0 horizontal dependencies, and everything is lazy. By contrast, the same indexing pattern used on a dummy dataset (da.zeros combined with stack) with the same shape + dtype + chunking takes ~.700 ms per "image" with a total of ~9s for [x for x in dummy_data]. That's not lighting-fast but it's acceptable, unlike 900s.
Short-term it's clear that I should simply avoid the pattern of big_array = da.stack(sub_arrays)...[b for b in big_array] when big_array is large... But I wonder a) why this pattern is so slow and b) if there's anything I can do (or dask can do) to speed up this access pattern.