Conversation
|
The CI is currently failing because in my attempt to fix the failing cupy test, I've used the numpy array creation |
pentschev
left a comment
There was a problem hiding this comment.
Thanks @GenevieveBuckley , I can confirm the test passes on my end as well. The fact that we now need NumPy>=1.20 is technically a regression, but probably a small one, I'm fine with telling users they need to upgrade. But could you add the line below to test_bincount to ensure it doesn't fail on older NumPy versions?
@pytest.mark.skipif(not _numpy_120, reason="NEP-35 is not available")…s using older versions of numpy
|
Thanks @GenevieveBuckley for finding this! In xhistogram (xgcm/xhistogram#27) we are experiencing problems with Specifically: import dask.array as dsa
ones = dsa.ones(shape=(100,), chunks=(110), dtype='int')
print(dsa.bincount(ones, minlength=2))
# -> dask.array<_bincount_agg-aggregate, shape=(), dtype=int64, chunksize=(), chunktype=numpy.ndarray>the output of It looks like this PR will fix that. So 🙌 for your work! |
|
It's nice to hear that @rabernat - this has been an unexpectedly high-impact change. I only knew about the problems in scikit-image when I started looking at it. |
|
The cupy bincount test passes locally now (I have cupy version 9.0.0b3 and numpy version 1.20.1 |
|
I think this is ready to merge @dask/maintenance |
pentschev
left a comment
There was a problem hiding this comment.
@GenevieveBuckley thanks for the latest changes, I added another suggestion on making this a bit safer, not sure if that's really possible to happen in Dask but I guess there's no harm in doing so.
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks for your work on this @GenevieveBuckley!
Since axis is always (0,) we can streamline some of the logic around constructing chunks tuples (see the diff below). My hope is that will make reasoning about the logic here easier in the future.
Diff:
diff --git a/dask/array/routines.py b/dask/array/routines.py
index 8d53541e..69f56b63 100644
--- a/dask/array/routines.py
+++ b/dask/array/routines.py
@@ -645,7 +645,6 @@ def bincount(x, weights=None, minlength=0, split_every=None):
if weights.chunks != x.chunks:
raise ValueError("Chunks of input array x and weights must match.")
- axis = (0,)
token = tokenize(x, weights, minlength)
args = [x, "i"]
if weights is not None:
@@ -655,32 +654,27 @@ def bincount(x, weights=None, minlength=0, split_every=None):
meta = array_safe(np.bincount([]), x._meta)
if minlength == 0:
- output_size = np.nan
+ output_size = (np.nan,)
else:
- output_size = minlength
+ output_size = (minlength,)
chunked_counts = blockwise(
partial(np.bincount, minlength=minlength), "i", *args, token=token, meta=meta
)
- chunked_counts._chunks = tuple(
- (output_size,) * len(c) if i in axis else c
- for i, c in enumerate(chunked_counts.chunks)
- )
+ chunked_counts._chunks = (output_size * len(chunked_counts.chunks[0]), *chunked_counts.chunks[1:])
from .reductions import _tree_reduce
output = _tree_reduce(
chunked_counts,
aggregate=partial(_bincount_agg, dtype=meta.dtype),
- axis=axis,
+ axis=(0,),
keepdims=True,
dtype=meta.dtype,
split_every=split_every,
concatenate=False,
)
- output._chunks = tuple(
- (output_size,) if i in axis else c for i, c in enumerate(chunked_counts.chunks)
- )
+ output._chunks = (output_size, *chunked_counts.chunks[1:])
output._meta = meta
return outputCo-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
|
Thank you for the feedback @pentschev and @jrbourbeau |
pentschev
left a comment
There was a problem hiding this comment.
LGTM, thanks @GenevieveBuckley ! 🙂
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks for all your work on this @GenevieveBuckley and for reviewing @pentschev! This is in
@pentschev could you clarify when NumPy>=1.20 is needed? I noticed we're only skipping |
@jrbourbeau I hope this helps explain it: If we use
|
|
Great, thanks for that very clear explanation @GenevieveBuckley |
* Add VDS creation command-line tool * Add Dask image binner command-line tool, with options for single-image, multi-image sweep and multi-image pump-probe binning. * Add command line tool to inspect cue messages * Avoid Dask v2021.03.0 due to a regression in that release — the dask.array.bincount function does not permit slicing in that version (see dask/dask#7391). * Add more cue message info and tidy up the functions for discovering the timestamps of chosen cues
We've found problems trying to slice the output of
dask.array.bincount()following #7183Passes
black dask/flake8 daskTests added / passed
dask/array/test/test_routines.pyto check the output ofda.bincountcan be slicedpytest skimage/filters/tests/test_thresholding.py::test_thresholds_dask_compatibility -vvRelevant issues:
bincount)Summary:
When
dask.array.bincountwas modified in #7183 we lost the ability to slice the output returned by this function. The outputshapeandchunksattributes were empty tuples instead of what we expected.We want to keep the benefits of PR 7183 because it's a lot better at managing memory when given large datasets.
So to add back in the ability to slice the output from
bincount, I've added some of the extra logic normally handled bydask.array.reductions.reduction()to keep track of the array shape & chunk sizes.