[WIP] ENH Adds tree reduction to bincount by thomasjpfan · Pull Request #7183 · dask/dask

thomasjpfan · 2021-02-05T20:34:29Z

Closes Use tree reduction in da.bincount #4852
Tests added / passed
Passes black dask / flake8 dask

This implementation is not faster with this benchmark:

import dask.array as da
import numpy as np

rng = np.random.RandomState(42)
x = rng.randint(5000, size=10_000_000)
x_da = da.from_array(x, chunks=300_000)
out = da.bincount(x_da)

%%timeit
_ = out.compute()

This PR

26.1 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Master

21.2 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Maybe there is not a benefit to tree reduction here?

jsignell · 2021-02-17T15:15:31Z

Huh. If there is no improvement do you think it's worth including this? Or do you think there'd be an improvement with a different benchmark?

thomasjpfan · 2021-02-23T04:13:24Z

I think the benefit of this PR is memory usage for larger arrays:

import dask.array as da
import dask
from dask.diagnostics import ResourceProfiler
import numpy as np

rng = np.random.RandomState(42)
x = rng.randint(50000, size=500_000_000)
x_da = da.from_array(x, chunks=100_000)
out = da.bincount(x_da)

with ResourceProfiler(dt=0.01) as rprof:
    out.compute()
    
rprof.visualize()

On master: Goes to ~ 7.4 GB

This PR: Stays at ~ 4.7GB

master is always a little faster but it uses more memory. On my machine, the CPU utilization is not as great with my implementation, I have a feeling it is how I am using blockwise. I think going back to constructing the Highlevelgraph by hand may be better.

Edit: The dask profilers are awesome!

jsignell · 2021-02-23T15:32:46Z

Oh that is a huge memory improvement! Thanks for taking the time to write that up :)

jsignell · 2021-02-23T16:12:14Z

dask/array/routines.py

-            for i, _ in enumerate(x.__dask_keys__())
-        }
-        dtype = np.bincount([1], weights=[1]).dtype
+        meta = meta_from_array(weights)


I think the meta should be taken from some small version of the bincount like how dtype was before:

Suggested change

meta = meta_from_array(weights)

meta = np.bincount([1], weights=[1])

Here's a test for this case:

def test_bincount_with_int_weights(): x = np.array([2, 1, 5, 2, 1]) d = da.from_array(x, chunks=2) weights = np.array([1, 2, 1, 0, 1]) dweights = da.from_array(weights, chunks=2) e = da.bincount(d, weights=dweights, minlength=6) assert_eq(e, np.bincount(x, weights=dweights.compute(), minlength=6)) assert same_keys(da.bincount(d, weights=dweights, minlength=6), e)

jsignell · 2021-02-23T17:08:48Z

dask/array/tests/test_routines.py

+        np.array([1, 2, 1, 0.5, 1], dtype=np.float32),
+        np.array([1, 2, 1, 0, 1], dtype=np.int32),
+    ],
+)


…uction_rb

thomasjpfan · 2021-02-23T20:59:57Z

Same benchmark with Profiler included:

On `master`

This PR

My guess is that there is more gil contention when running _bincount_agg and np.bincount together.

jsignell · 2021-02-24T15:15:47Z

This seems reasonable to me. Thanks for doing this work!

jakirkham · 2021-02-24T16:28:49Z

Thanks Thomas! 😀

ENH Adds tree reduction to bincount

cb34c89

thomasjpfan changed the title ~~ENH Adds tree reduction to bincount~~ [WIP] ENH Adds tree reduction to bincount Feb 23, 2021

jsignell reviewed Feb 23, 2021

View reviewed changes

CLN Uses meta from small bincount

b329368

jsignell reviewed Feb 23, 2021

View reviewed changes

thomasjpfan added 3 commits February 23, 2021 15:32

Merge remote-tracking branch 'upstream/master' into bincount_tree_red…

e3d8b69

…uction_rb

REV Revert change

53adf58

ENH Early exit if not a list

c117371

jsignell merged commit bf1c65c into dask:master Feb 24, 2021

pentschev mentioned this pull request Mar 5, 2021

Failing CuPy tests #7324

Closed

This was referenced Mar 12, 2021

IndexError("Too many indices for array") raised when attempting to run the K-Means|| example dask/dask-ml#803

Closed

[Discussion] Employ repository dispatch events to trigger CI jobs in dependent repos #7385

Closed

This was referenced Mar 15, 2021

test_thresholding fails with last dask version (2021.3.0) scikit-image/scikit-image#5266

Closed

Bincount fix slicing #7391

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] ENH Adds tree reduction to bincount#7183

[WIP] ENH Adds tree reduction to bincount#7183
jsignell merged 5 commits intodask:masterfrom
thomasjpfan:bincount_tree_reduction_rb

thomasjpfan commented Feb 5, 2021

Uh oh!

jsignell commented Feb 17, 2021

Uh oh!

thomasjpfan commented Feb 23, 2021 •

edited

Loading

Uh oh!

jsignell commented Feb 23, 2021

Uh oh!

jsignell Feb 23, 2021

Uh oh!

jsignell Feb 23, 2021

Uh oh!

thomasjpfan commented Feb 23, 2021

Uh oh!

jsignell commented Feb 24, 2021

Uh oh!

jakirkham commented Feb 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	meta = meta_from_array(weights)
	meta = np.bincount([1], weights=[1])

Uh oh!

Conversation

thomasjpfan commented Feb 5, 2021

This PR

Master

Uh oh!

jsignell commented Feb 17, 2021

Uh oh!

thomasjpfan commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsignell commented Feb 23, 2021

Uh oh!

jsignell Feb 23, 2021

Choose a reason for hiding this comment

Uh oh!

jsignell Feb 23, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Feb 23, 2021

On master

This PR

Uh oh!

jsignell commented Feb 24, 2021

Uh oh!

jakirkham commented Feb 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thomasjpfan commented Feb 23, 2021 •

edited

Loading

On `master`