[WIP] ENH Adds tree reduction to bincount#7183
Conversation
|
Huh. If there is no improvement do you think it's worth including this? Or do you think there'd be an improvement with a different benchmark? |
|
I think the benefit of this PR is memory usage for larger arrays: import dask.array as da
import dask
from dask.diagnostics import ResourceProfiler
import numpy as np
rng = np.random.RandomState(42)
x = rng.randint(50000, size=500_000_000)
x_da = da.from_array(x, chunks=100_000)
out = da.bincount(x_da)
with ResourceProfiler(dt=0.01) as rprof:
out.compute()
rprof.visualize()On This PR: Stays at ~ 4.7GB
Edit: The dask profilers are awesome! |
|
Oh that is a huge memory improvement! Thanks for taking the time to write that up :) |
dask/array/routines.py
Outdated
| for i, _ in enumerate(x.__dask_keys__()) | ||
| } | ||
| dtype = np.bincount([1], weights=[1]).dtype | ||
| meta = meta_from_array(weights) |
There was a problem hiding this comment.
I think the meta should be taken from some small version of the bincount like how dtype was before:
| meta = meta_from_array(weights) | |
| meta = np.bincount([1], weights=[1]) |
Here's a test for this case:
def test_bincount_with_int_weights():
x = np.array([2, 1, 5, 2, 1])
d = da.from_array(x, chunks=2)
weights = np.array([1, 2, 1, 0, 1])
dweights = da.from_array(weights, chunks=2)
e = da.bincount(d, weights=dweights, minlength=6)
assert_eq(e, np.bincount(x, weights=dweights.compute(), minlength=6))
assert same_keys(da.bincount(d, weights=dweights, minlength=6), e)| np.array([1, 2, 1, 0.5, 1], dtype=np.float32), | ||
| np.array([1, 2, 1, 0, 1], dtype=np.int32), | ||
| ], | ||
| ) |
|
This seems reasonable to me. Thanks for doing this work! |
|
Thanks Thomas! 😀 |




da.bincount#4852black dask/flake8 daskThis implementation is not faster with this benchmark:
This PR
Master
Maybe there is not a benefit to tree reduction here?