Skip to content

Faster array slicing#1731

Merged
mrocklin merged 17 commits intodask:masterfrom
pitrou:faster_slicing
Nov 3, 2016
Merged

Faster array slicing#1731
mrocklin merged 17 commits intodask:masterfrom
pitrou:faster_slicing

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Oct 31, 2016

Despite the PR title, this also brings more generic optimizations to dependency computations, and various optimizations.

dask/core.py Outdated
set(['x'])

>>> get_dependencies(dsk, 'z') # doctest: +SKIP
>>> sorted(get_dependencies(dsk, 'z')) # doctest: +SKIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we lose the doctest: +SKIP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to remove the sorted() call instead.

return ns['expand']


def expander(where):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why separate this into two functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the memoize decorator needs hashable arguments (which lists are not).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try @memoize(key=tuple)

Also, should we be concerned about the memoized function holding on to too many inputs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, unless someone has a need for a large number of locations for newaxis in their indices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try @memoize(key=tuple)

The signature for the key argument isn't very pretty (it takes a args, kwargs tuple). I wonder if it's not more readable to keep the two-argument variant.

@pitrou pitrou changed the title [WIP] Faster array slicing Faster array slicing Oct 31, 2016
@pitrou
Copy link
Member Author

pitrou commented Oct 31, 2016

A bit embarassing: this PR makes https://nbviewer.jupyter.org/gist/shoyer/425cb3dc9101c235fe86 slower.
I'll have to take a deeper look.

@pitrou
Copy link
Member Author

pitrou commented Nov 1, 2016

I think I've fixed the regression. Also optimized inline_functions by using a recursive solution.

@pitrou
Copy link
Member Author

pitrou commented Nov 1, 2016

I've now found out a way to make _deps() faster in an iterative way.

@mrocklin
Copy link
Member

mrocklin commented Nov 1, 2016

Why does creating the new list cause such a performance difference? What differences are you seeing?

@pitrou
Copy link
Member Author

pitrou commented Nov 1, 2016

The work list avoids both the recursion and the manual stack handling. The number of outer loop iterations is equal to the maximum task nesting level, which is usually very small.

@pitrou
Copy link
Member Author

pitrou commented Nov 1, 2016

What difference I'm seeing: on

N = 1000
x = da.ones((N, N), chunks=(10, 10), name='x')
z = [x[i] for i in range(N)]

_deps() used to take 700 ms, it now only takes 500 ms.

As a whole, the cost of dependency computation is divided by 3 with this PR on the micro-benchmark above.

@pitrou
Copy link
Member Author

pitrou commented Nov 1, 2016

Sorry for the churn. I think this PR is now ready for review or merging.

dask/core.py Outdated
"""
if not isinstance(tasks, list):
raise TypeError("Please provide a list of tasks")
return set(_deps(dsk, tasks))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the new simplified structure of get_dependencies is this function still useful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because passing a bunch of tasks at once is more efficient than calling _deps() for each one.
(note _deps() can be very fast on trivial entries)

dask/core.py Outdated
if as_list:
return {k: _deps(dsk, dsk[k]) for k in keys}
else:
return {k: set(_deps(dsk, dsk[k])) for k in keys}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above. Do we still need this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case above I now see that we're sending a list of tasks. In this case though we're just using a dict comprehension. That dict-comprehension could presumably be inlined. I suppose the gains here are based on the type checks and avoiding the extra function call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are. We could further inline the dict comprehension, but at the price of code duplication, so I'm not sure that it's a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a single line dict comprehension then I would prefer the code duplication. My opinion is that code indirection (having to follow many different functions to find the logic you want) can be as bad as code duplication in some cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... I'll take a look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_deps() is a private API, should we make it public? Use get_dependencies instead (at the price of a small slowdown - cull() seems to become 15% slower)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on removing _deps entirely and folding it into get_dependencies? We could place the burden of arg/task inputs and list/set outputs on the caller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also happen later. Mostly I want to push on the principle of "there is a maintenance cost to having lots of functions that do similar things". If we can have a few of these that compose well with standard language constructs like dictionary comprehensions then I think that dask.core will remain an easy place to get started for newish developers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agreed. I'll try to fold _deps into get_dependencies.

@mrocklin
Copy link
Member

mrocklin commented Nov 1, 2016

@eriknw some of the performance optimizations to dask.core and dask.optimize might interest you.

@pitrou
Copy link
Member Author

pitrou commented Nov 1, 2016

Hmm, I'm not sure why 831be05 produced the following failure on AppVeyor:
https://ci.appveyor.com/project/dask-ci/dask/build/1.0.222#L490

Could it be an unrelated intermittent failure? One possible explanation is that empty() returned some NaNs or Infs and allclose() failed.

@mrocklin
Copy link
Member

mrocklin commented Nov 1, 2016

Yes, I suspect that this is caused by empty. We should probably replace with random.random

@eriknw
Copy link
Member

eriknw commented Nov 1, 2016

Thanks for the ping, I'll take a look when convenient.

@mrocklin
Copy link
Member

mrocklin commented Nov 1, 2016

Two thoughts:

  1. _deps is also used a couple of times within dask/distribued
  2. Thoughts on removing the as_list= keyword parameter and instead having the caller do this expilcitly as in set(get_dependencies(dsk, task))

@pitrou
Copy link
Member Author

pitrou commented Nov 1, 2016

For distributed, see dask/distributed#616.
as_list feels ok to me right now, but we can revisit it later. Returning a set by default makes more sense because the result is naturally unordered, and in most cases you don't care about duplicates.

@mrocklin
Copy link
Member

mrocklin commented Nov 1, 2016

OK

@mrocklin
Copy link
Member

mrocklin commented Nov 1, 2016

This seems fine to me. I recommend that we wait a bit for dask/distributed#616 to get in and in case @eriknw has some time for review. Otherwise I plan to merge tomorrow morning (around 18 hours from now)

@pitrou pitrou added the array label Nov 1, 2016
.travis.yml Outdated
- pip install git+https://github.com/mrocklin/partd --upgrade
- pip install git+https://github.com/mrocklin/cachey --upgrade
- pip install git+https://github.com/dask/zict --upgrade
- pip install git+https://github.com/dask/distributed --upgrade
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to fix the following failure:
https://travis-ci.org/dask/dask/jobs/172618031#L731
I'm not sure why the build worked previously.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm getting another similar issue with s3fs: https://travis-ci.org/dask/dask/jobs/172623691#L751
Do you have an idea what may be happening?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3fs is included as a dependency in the distributed conda package but not in the PyPI package. We keep the PyPI packages lightweight.

new_work += w[1:]
elif typ is list:
new_work += w
elif typ is dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this added?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not added, it's part of the _deps() algorithm, rewritten iteratively, and folded here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. I'm confused why we want to do this with dicts. This was added in 05ddd68.

In [20]: d = {'x': 1, 'y': {'a': 'x'}}

In [21]: dask.core.get_dependencies(d, 'y')
Out[21]: {'x'}

In [22]: dask.core.get(d, ['y'])
Out[22]: ({'a': 'x'},)

In [23]: dask.get(d, ['y'])
---------------------------------------------------------------------------
KeyError: 'y'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea from me :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eriknw we do this for the distributed scheduler. It ended up being useful, though at the moment I don't particularly recall the original use case.

In [1]: from distributed import Client
In [2]: c = Client()
In [3]: d = {'x': 1, 'y': {'a': 'x'}}
In [4]: c.get(d, 'y')
Out[4]: {'a': 1}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the dict solution doesn't look pretty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a simple, non-recursive helper function that makes the intent explicit and doesn't require changing the spec:

In [35]: def subdask(d):
    ...:     return (dict, (zip, tuple(d.keys()), list(d.values())))
    ...: 
    ...: d = {
    ...:     'a': 1,
    ...:     'b': 2,
    ...:     'c': subdask({'a': 'a', 'b': 'b'}),
    ...: }
    ...: 

In [36]: d
Out[36]: {'a': 1, 'b': 2, 'c': (dict, (<function zip>, ('a', 'b'), ['a', 'b']))}

In [37]: dask.core.get(d, 'c')
Out[37]: {'a': 1, 'b': 2}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I'm assuming this discussion isn't an issue for the current PR, right? :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. The PR lgtm. It keeps the existing behavior with dicts, which we can revisit elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrocklin
Copy link
Member

mrocklin commented Nov 2, 2016

Merging in 12 hours if no comment

@mrocklin mrocklin merged commit 1b18829 into dask:master Nov 3, 2016
@sinhrks sinhrks added this to the 0.12.0 milestone Nov 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants