Performance enhancement: pre-compute chunk boundaries

I'm opening this to start a discussion - I don't know if will actually turn out to be a good idea. If there's agreement that it's worth investigation I will try to find some time to implement it.

# Proposal

At present a dask array stores the chunking scheme as a list of chunk sizes (per dimension), but there are a number of places where the positions of chunk boundaries are needed (just search for `np.cumsum` to see some examples). Potentially these cumulative sums could be computed when an Array is constructed and stored in the array. Obviously it would not be usable for arrays with unknown chunk sizes.

# Background

We have a home-grown chunked array storage format, and present it to the user as a Python object that supports slicing to load numpy arrays (much like h5py does). The implementation uses dask, but the user doesn't need to know anything about dask, and a somewhat typical access pattern is to iteratively process a file, loading a few rows at a time. That means that we end up treating dask something like this:
```python
for i in range(0, a.shape[0], step):
    chunk = a[i:i+step, ...].compute()
    # do something with chunk
```

Assuming that `step` matches the chunk size and there are *n* chunks along axis 0, constructing each `a[i:i+step]` takes O(n) time for the cumsum (in `_slice_1d`) and so the whole loop takes O(n²) time. If the cumulative sums are precomputed, then I think the whole loop should only require O(n log n) time (the log n coming from `searchsorted`). Some of our datasets have 10K-100K chunks along this axis, so that would be a substantial improvement.

It's possible that this change won't actually fix the problem if there is anything else in the process that requires time O(n) time per iteration. We've had some issues in the past with the graph optimiser doing some work prior to pruning, but I haven't checked whether that's still an issue now that HighLevelGraph has been introduced.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance enhancement: pre-compute chunk boundaries #4867

Proposal

Background

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Performance enhancement: pre-compute chunk boundaries #4867

Description

Proposal

Background

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions