-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
I'm opening this to start a discussion - I don't know if will actually turn out to be a good idea. If there's agreement that it's worth investigation I will try to find some time to implement it.
Proposal
At present a dask array stores the chunking scheme as a list of chunk sizes (per dimension), but there are a number of places where the positions of chunk boundaries are needed (just search for np.cumsum to see some examples). Potentially these cumulative sums could be computed when an Array is constructed and stored in the array. Obviously it would not be usable for arrays with unknown chunk sizes.
Background
We have a home-grown chunked array storage format, and present it to the user as a Python object that supports slicing to load numpy arrays (much like h5py does). The implementation uses dask, but the user doesn't need to know anything about dask, and a somewhat typical access pattern is to iteratively process a file, loading a few rows at a time. That means that we end up treating dask something like this:
for i in range(0, a.shape[0], step):
chunk = a[i:i+step, ...].compute()
# do something with chunkAssuming that step matches the chunk size and there are n chunks along axis 0, constructing each a[i:i+step] takes O(n) time for the cumsum (in _slice_1d) and so the whole loop takes O(n²) time. If the cumulative sums are precomputed, then I think the whole loop should only require O(n log n) time (the log n coming from searchsorted). Some of our datasets have 10K-100K chunks along this axis, so that would be a substantial improvement.
It's possible that this change won't actually fix the problem if there is anything else in the process that requires time O(n) time per iteration. We've had some issues in the past with the graph optimiser doing some work prior to pruning, but I haven't checked whether that's still an issue now that HighLevelGraph has been introduced.