Merged
Conversation
One of the performance problems identified in dask#5913 was because Array.shape gets called many times, and each time it is computed by summing the chunk sizes. By using `cached_cumsum`, the calculation should be fast after the first time. While a cumsum is overkill for computing shape, it's likely that the cached result will be used again at some point. An alternative would be just to store the shape in the object (e.g. computed during the destructor). That would need all the bits of internal code that modify `_chunks` directly to be modified to update the shape as well. Also change `cached_cumsum` to return a tuple instead of a numpy array, and also use `toolz.accumulate` rather than `np.cumsum`. I originally needed this because `shape` was getting the wrong type (tuple of numpy scalars instead of Python scalars), but it also makes the example in dask#5913 (comment) about 10% faster. While it's a little slower to compute the first time, the result of `cached_cumsum` is typically used for iteration or indexing in Python land, rather than with vectorised numpy operations, and those are a lot faster on a tuple. Similarly the bisect module is much faster than np.searchsorted because the latter is dominated by C API overhead. The one downside of using a tuple instead of a numpy array is that slicing it is O(n) instead of O(1). That means that the initial_zero=False answer is now a separate cache entry instead of the initial_zero=True answer with the first element sliced off. There are also a few bits of code that don't need the final element and slice it with `[:-1]` which will now be making a copy. But I think it's still an improvement overall.
Member
|
Thanks for this @bmerry . I went through the uses of cached_cumsum and I agree that this seems to be used mostly in python-scale cases. In general things here look good to me. I'm going to go ahead and merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
One of the performance problems identified in #5913 was because
Array.shape gets called many times, and each time it is computed by
summing the chunk sizes. By using
cached_cumsum, the calculationshould be fast after the first time. While a cumsum is overkill for
computing shape, it's likely that the cached result will be used again
at some point.
An alternative would be just to store the shape in the object (e.g.
computed during the destructor). That would need all the bits of
internal code that modify
_chunksdirectly to be modified to updatethe shape as well.
Also change
cached_cumsumto return a tuple instead of a numpy array,and also use
toolz.accumulaterather thannp.cumsum. I originallyneeded this because
shapewas getting the wrong type (tuple of numpyscalars instead of Python scalars), but it also makes the example in
#5913 (comment) about
10% faster. While it's a little slower to compute the first time, the
result of
cached_cumsumis typically used for iteration or indexing inPython land, rather than with vectorised numpy operations, and those are
a lot faster on a tuple. Similarly the bisect module is much faster than
np.searchsorted because the latter is dominated by C API overhead.
The one downside of using a tuple instead of a numpy array is that
slicing it is O(n) instead of O(1). That means that the
initial_zero=False answer is now a separate cache entry instead of the
initial_zero=True answer with the first element sliced off. There are
also a few bits of code that don't need the final element and slice it
with
[:-1]which will now be making a copy. But I think it's still animprovement overall.
black dask/flake8 dask