Tree reductions for dask.array by jcrist · Pull Request #839 · dask/dask

jcrist · 2015-11-19T23:10:31Z

This enables support for tree reductions, which should improve efficiency of using dask.array across multiple processes/machines, or when arrays are composed of a large number of chunks.

The idea is to set a maximum number of chunks to be gathered and combined (either overall, or by axis) when performing reductions - breaking the reduce step of map-reduce into a tree of smaller reductions.

At an api level, reductions expose a max_leaves kwarg, which defaults to the current behavior. It accepts either a dict of {axis: max_chunks}, or an integer, which is used to compute max_chunks for each dimension such that the total number of chunks gathered in each reduction is approximately max_leaves. For example, axis=0, max_leaves=16 -> max_leaves={0: 16}, axis=(0, 1), max_leaves=16 -> max_leaves={0: 4, 1: 4}.

Example:

import dask.array as da
import numpy as np

x = np.arange(1, 122).reshape((11, 11)).astype('f4')
a = da.from_array(x, chunks=(4, 4))
o = a.sum(axis=0, max_leaves=4)

What would normally be a single step reduction has been broken into a tree of depth 2.

Todo:

Add support for moment (and thus std, var, etc...)
Add support for arg* reductions
More tests

…e_red

shoyer · 2015-11-19T23:14:18Z

Does using a tree merge cause performance degradation for the single node case? If not, we might want to default to that behavior.

jcrist · 2015-11-19T23:16:07Z

For a wide enough tree, no. It actually might improve performance, as the aggregate step can be done in parallel (probably negligible though). Trick is determining what is a sane default number.

mrocklin · 2015-11-19T23:22:18Z

These produce different results

In [11]: x.sum(max_leaves=4).visualize('dask.pdf')

In [12]: x.sum(max_leaves={0: 4}).visualize('dask.pdf')

mrocklin · 2015-11-19T23:26:39Z

I wouldn't expect much degredation in non-pathological cases. The worst thing we're doing is adding O(log_k n) more tasks.

Suggest renaming max_leaves. I think that your intent of the name was something akin to max_children. There might even be a slicker name than that. I'm trying to think of a term like breadth that might be the converse of a depth of a tree. Perhaps "branching factor", if it wasn't so verbose.

mrocklin · 2015-11-19T23:28:19Z

dask/array/reductions.py

If you feel particularly underburdened we could move the partial call here into atop, allowing it to support keyword args.

Are you saying that atop should handle the partial itself?

mrocklin · 2015-11-19T23:35:21Z

Any thoughts on how hard the combine function for moments will be?

mrocklin · 2015-11-19T23:38:07Z

I haven't worked through the gritty details, but at first glance this looks pretty cool to me.

jcrist · 2015-11-20T04:58:08Z

These produce different results

What is the shape of x in that case? If it's not 1d, then I'd expect them to. I definitely need to clean up input verification though, as it occurs to me that I never verify that sorted(axis) == sorted(max_leaves). Need max block combination sizes for all reduction axis.

mrocklin · 2015-11-20T05:02:25Z

What is the shape of x in that case? If it's not 1d, then I'd expect them to. I definitely need to clean up input verification though, as it occurs to me that I never verify that sorted(axis) == sorted(max_leaves). Need max block combination sizes for all reduction axis.

x = da.arange(1000, chunks=100)

There is some input validation missing. This is also the motivation behind the "we should test the branching factor explicitly somewhere" comment

- Input cleanup - More tests - Fix integer `max_leaves` input bug

jcrist · 2015-11-20T22:42:47Z

I cleaned up the tests, and fixed the bug you pointed out above. Could always test more, but I think the coverage here is pretty good.

Any thoughts on how hard the combine function for moments will be?

It's proving harder than I would like - my tired brain isn't up for mathing it out right now :/. Should be doable though, just need to do some math.

mrocklin · 2015-11-20T22:55:50Z

dask/array/tests/test_reductions.py

You could use dask.core.get_deps to get the dependencies dict and then use this dictionary in more sophisticated tests like the following:

dependencies, dependents = get_deps(x.sum(max_leaves={0: 2, 1: 3}).dask) assert max(map(len, dependencies.values()) == 2 * 3 dependencies, dependents = get_deps(x.sum(axis=1, max_leaves={0: 2, 1: 3}).dask) assert max(map(len, dependencies.values()) == 3

There are presumably several such interesting configurations that would be useful to verify now, rather than several months from now when someone screws with this code unknowingly.

Yeah, I saw that. Will fix.

mrocklin · 2015-11-24T15:27:55Z

@jcrist what is the status on this? I wouldn't mind using it to redo recent array experiments.

jcrist · 2015-11-24T16:04:31Z

Doesn't support arg_*, var, std, or moment, and could use some more tests. I'm mostly out this week, so I doubt this will be fixed until next week. Should be good to go for redoing the array experiments though, if you don't mind working off non-merged stuff.

mrocklin · 2015-11-24T23:26:58Z

I'm not sure if these should have the same name or not.

In [1]: import dask.array as da

In [2]: x = da.random.random((100, 100), chunks=(10, 10))

In [3]: x.mean(axis=0).name
Out[3]: 'reduce-460ea5b3313d36450410978acc2ace95'

In [4]: x.mean(axis=0, max_leaves=6).name
Out[4]: 'reduce-460ea5b3313d36450410978acc2ace95'

mrocklin · 2015-11-24T23:29:57Z

Using this branch I've found a speedup from 30s to 5s when doing a reduction of 2GB across a small network of 3 machines. This is only when the reduction required data sharing and was only because we happened to have scattered the data in a way that aligns with this (both defaults cause nice behavior.) There was no speedup or slowdown when the reduction didn't require significant data transfer, though that case was fast anyway (1-2s).

mrocklin · 2015-11-24T23:31:53Z

Reasons to keep the keys the same:

They're mathematically the same result, just computed in different ways

Reasons to keep the keys separate and include max_leaves in tokenize:

The graphs are quite different, and so assumptions made about same keys having the same dependencies are incorrect.

jcrist · 2015-12-01T21:41:59Z

I made them have the same keys so that it would play well with caching. I could revert this, but it seemed like the correct behavior in my mind. Where do we make assumptions that identical key names have the same dependencies?

mrocklin · 2015-12-01T21:45:37Z

Within the distributed scheduler.

Or, more generally, within any scheduler that supports graph updates.

jcrist · 2015-12-01T21:49:13Z

Ah, that makes sense. I don't have strong opinions on this, happy to go either way.

mrocklin · 2015-12-01T21:49:47Z

My life will be easier if we include the branching factor within tokenize. It also follows the "when in doubt, disambiguate keys" principle.

mrocklin · 2015-12-01T21:52:24Z

I'm still in favor of changing the name away from max_leaves. To me a leaf is strictly at the bottom of a tree. This is a restriction both on leaves and on interior nodes.

jcrist · 2015-12-01T21:57:25Z

Better name suggestion? Some ideas:

max_children
max_branches
branches
branch_factor
bfactor

mrocklin · 2015-12-01T21:58:00Z

I like max_children and branch_factor

mrocklin · 2015-12-01T23:06:43Z

Heuristics sound like they would be valuable. I think that we'll be better able to do this after some use. I could just @jcrist 's help on some other things so I'm still behind a default that is just-a-number for now.

What are your thoughts on the kwarg names above?

shoyer · 2015-12-01T23:22:43Z

I think max_children is better than branch_factor, which is more vague. It still feels more evocative for the particular implementation (using trees) rather than the concept (splitting the reduction into sub-problems).

More options:
split_threshold
max_chunks
max_splits
reduce_limit
subproblem_size
max_subproblems

I think split_threshold is my favorite.

Previously `a.sum().name == a.sum(split_threshold=2).name`. This has been removed, as distributed assumes keys with the same name have the same dependencies.

jcrist · 2015-12-03T19:01:06Z

This could use another review. I changed the keyword to split_threshold, added it to set_options, and defaulted at 32. Tests were improved, and all reductions are now supported.

mrocklin · 2015-12-03T21:58:32Z

dask/array/tests/test_reductions.py

This is a reassuring test. Thanks!

mrocklin · 2015-12-03T22:18:43Z

This looks great to me.

mrocklin · 2015-12-03T22:18:46Z

+1

Tree reductions for dask.array

jcrist · 2015-12-03T22:35:57Z

I expect this to need some use before we can determine a good default behavior/smart heuristic. I'm going to experiment with the ocean dataset to see how different configurations fair. If others have some workflows they could try this on, it would be much appreciated.

mrocklin · 2015-12-04T02:19:52Z

I was playing with this with distributed with @stefanv . Some issues came up

The split_threshold kwarg wasn't immediately clear to him (perhaps he can chime in here on his thoughts
We ran into an issue about reductions on flexible types while computing a mean. This might be a distributed or a dask.array issue. Some play testing is in order. Flexible type issues arise when you try to call a typical reduction on an array of struct dtype.

jcrist · 2015-12-04T02:55:59Z

There are no docs for this yet, so any suggestions on how to make this more intuitive to use would be much appreciated.

We ran into an issue about reductions on flexible types while computing a mean.

I assume you weren't trying to compute a reduction on a struct array (which doesn't work even in numpy), but instead a bug somewhere in dask/distributed that resulted in the reduction being computed on a struct array? If you remember what computation caused it, it would be nice to be able to reproduce.

stefanv · 2015-12-04T03:51:33Z

The reason split_threshold did not immediately make sense to me is because it contains these two concepts I needed to think about: split and threshold. And threshold is slightly confusing because it could imply "split after this condition" or "split until this condition", with potentially different meanings ("split after n chunks" or "split until you have n chunks").

So, while I don't propose this name, something like split_every_n_chunks would read very straightforwardly. Maybe there's a shorter version of that, like chunks_per_split, split_every, group_chunks, etc.

shoyer · 2015-12-04T06:21:14Z

I like Stefan's names!

On Thu, Dec 3, 2015 at 7:51 PM, Stefan van der Walt
notifications@github.com wrote:

The reason split_threshold did not immediately make sense to me is because it contains these two concepts I needed to think about: split and threshold. And threshold is slightly confusing because it could imply "split after this condition" or "split until this condition", with potentially different meanings ("split after n chunks" or "split until you have n chunks").

So, while I don't propose this name, something like split_every_n_chunks would read very straightforwardly. Maybe there's a shorter version of that, like chunks_per_split, split_every, group_chunks, etc.

Reply to this email directly or view it on GitHub:
#839 (comment)

jcrist · 2015-12-11T23:01:24Z

Sorry, this somehow slipped through. Of the new names, I like split_every the best, when used in a function call (i.e. a.mean(axis=1, split_every=32)). As a global/context config though with set_options, I like split_ever_n_chunks the best, as it's more descriptive. Unsure what's best here. Either way, a better name would be nice.

mrocklin · 2015-12-11T23:08:17Z

I'd prefer using the same name in both places if possible.

On Fri, Dec 11, 2015 at 3:01 PM, Jim Crist notifications@github.com wrote:

Sorry, this somehow slipped through. Of the new names, I like split_every
the best, when used in a function call (i.e. a.mean(axis=1,
split_every=32)). As a global/context config though with set_options, I
like split_ever_n_chunks the best, as it's more descriptive. Unsure
what's best here. Either way, a better name would be nice.

—
Reply to this email directly or view it on GitHub
#839 (comment).

jcrist · 2015-12-11T23:09:34Z

Yeah, I wasn't suggesting different names in different spots. Perhaps just go with the more verbose one, as it will probably be set contextually/globally?

mrocklin · 2015-12-11T23:12:01Z

I'd prefer split_every over split_every_n_chunks

shoyer · 2015-12-11T23:15:17Z

I think split_every is probably verbose enough. I suspect it would usually
be poor practice to set this globally, given that it depends on the
particular reduction being performed.

On Fri, Dec 11, 2015 at 3:12 PM, Matthew Rocklin notifications@github.com
wrote:

I'd prefer split_every over split_every_n_chunks

—
Reply to this email directly or view it on GitHub
#839 (comment).

jcrist · 2015-12-11T23:19:47Z

Done. See #876.

jcrist added 2 commits November 19, 2015 16:53

First pass at tree reductions

dc22739

Merge branch 'master' of https://github.com/ContinuumIO/dask into tre…

1ec803a

…e_red

mrocklin reviewed Nov 19, 2015
View reviewed changes

Cleanups to tree reductions

4d1f835

- Input cleanup - More tests - Fix integer `max_leaves` input bug

mrocklin reviewed Nov 20, 2015
View reviewed changes

Tree reductions work for moment, var, and std

84f3a3f

jcrist added 5 commits December 2, 2015 18:24

Add support for arg_* tree reductions

a001092

max_leaves -> split_threshold

c2e5767

Different token for reduce with split_threshold

b1e2247

Previously `a.sum().name == a.sum(split_threshold=2).name`. This has been removed, as distributed assumes keys with the same name have the same dependencies.

Improve depth tests for tree reductions

2771230

Add split_threshold to set_options, default at 32

7b285dd

mrocklin reviewed Dec 3, 2015
View reviewed changes

dask/array/tests/test_reductions.py

Copy link
Copy Markdown

Member

mrocklin Dec 3, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reassuring test. Thanks!

mrocklin changed the title ~~[WIP] Tree reductions for dask.array~~ Tree reductions for dask.array Dec 3, 2015

jcrist added a commit that referenced this pull request Dec 3, 2015

Merge pull request #839 from jcrist/tree_red

2475078

Tree reductions for dask.array

jcrist merged commit 2475078 into dask:master Dec 3, 2015

jcrist deleted the tree_red branch December 3, 2015 22:26

jcrist mentioned this pull request Dec 11, 2015

split_threshold -> split_every #876

Merged

mrocklin mentioned this pull request Dec 17, 2015

associative operations max,mean run out of memory when it could be avoided #883

Closed

sinhrks added this to the 0.7.6 milestone Jan 7, 2016

phofl added a commit to phofl/dask that referenced this pull request Dec 23, 2024

Make filter pushdown work for more complicated filters (dask#839)

0fd68a5

Uh oh!

Conversation

jcrist commented Nov 19, 2015

Uh oh!

shoyer commented Nov 19, 2015

Uh oh!

jcrist commented Nov 19, 2015

Uh oh!

mrocklin commented Nov 19, 2015

Uh oh!

mrocklin commented Nov 19, 2015

Uh oh!

mrocklin Nov 19, 2015

Choose a reason for hiding this comment

Uh oh!

jcrist Nov 20, 2015

Choose a reason for hiding this comment

Uh oh!

mrocklin Nov 20, 2015

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Nov 19, 2015

Uh oh!

mrocklin commented Nov 19, 2015

Uh oh!

jcrist commented Nov 20, 2015

Uh oh!

mrocklin commented Nov 20, 2015

Uh oh!

jcrist commented Nov 20, 2015

Uh oh!

mrocklin Nov 20, 2015

Choose a reason for hiding this comment

Uh oh!

mrocklin Dec 1, 2015

Choose a reason for hiding this comment

Uh oh!

jcrist Dec 1, 2015

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Nov 24, 2015

Uh oh!

jcrist commented Nov 24, 2015

Uh oh!

mrocklin commented Nov 24, 2015

Uh oh!

mrocklin commented Nov 24, 2015

Uh oh!

mrocklin commented Nov 24, 2015

Uh oh!

jcrist commented Dec 1, 2015

Uh oh!

mrocklin commented Dec 1, 2015

Uh oh!

jcrist commented Dec 1, 2015

Uh oh!

mrocklin commented Dec 1, 2015

Uh oh!

mrocklin commented Dec 1, 2015

Uh oh!

jcrist commented Dec 1, 2015

Uh oh!

mrocklin commented Dec 1, 2015

Uh oh!

mrocklin commented Dec 1, 2015

Uh oh!

shoyer commented Dec 1, 2015

Uh oh!

jcrist commented Dec 3, 2015

Uh oh!

mrocklin Dec 3, 2015

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Dec 3, 2015

Uh oh!

mrocklin commented Dec 3, 2015

Uh oh!

jcrist commented Dec 3, 2015

Uh oh!

mrocklin commented Dec 4, 2015

So, while I don't propose this name, something like `split_every_n_chunks` would read very straightforwardly. Maybe there's a shorter version of that, like `chunks_per_split`, `split_every`, `group_chunks`, etc.