Conversation
dask/core.py
Outdated
| set(['x']) | ||
|
|
||
| >>> get_dependencies(dsk, 'z') # doctest: +SKIP | ||
| >>> sorted(get_dependencies(dsk, 'z')) # doctest: +SKIP |
There was a problem hiding this comment.
Can we lose the doctest: +SKIP?
There was a problem hiding this comment.
I opted to remove the sorted() call instead.
| return ns['expand'] | ||
|
|
||
|
|
||
| def expander(where): |
There was a problem hiding this comment.
Why separate this into two functions?
There was a problem hiding this comment.
Because the memoize decorator needs hashable arguments (which lists are not).
There was a problem hiding this comment.
Try @memoize(key=tuple)
Also, should we be concerned about the memoized function holding on to too many inputs?
There was a problem hiding this comment.
I don't think so, unless someone has a need for a large number of locations for newaxis in their indices.
There was a problem hiding this comment.
Try @memoize(key=tuple)
The signature for the key argument isn't very pretty (it takes a args, kwargs tuple). I wonder if it's not more readable to keep the two-argument variant.
|
A bit embarassing: this PR makes https://nbviewer.jupyter.org/gist/shoyer/425cb3dc9101c235fe86 slower. |
011ca23 to
86d3c19
Compare
|
I think I've fixed the regression. Also optimized |
|
I've now found out a way to make _deps() faster in an iterative way. |
|
Why does creating the new list cause such a performance difference? What differences are you seeing? |
|
The work list avoids both the recursion and the manual stack handling. The number of outer loop iterations is equal to the maximum task nesting level, which is usually very small. |
|
What difference I'm seeing: on _deps() used to take 700 ms, it now only takes 500 ms. As a whole, the cost of dependency computation is divided by 3 with this PR on the micro-benchmark above. |
|
Sorry for the churn. I think this PR is now ready for review or merging. |
dask/core.py
Outdated
| """ | ||
| if not isinstance(tasks, list): | ||
| raise TypeError("Please provide a list of tasks") | ||
| return set(_deps(dsk, tasks)) |
There was a problem hiding this comment.
Given the new simplified structure of get_dependencies is this function still useful?
There was a problem hiding this comment.
Yes, because passing a bunch of tasks at once is more efficient than calling _deps() for each one.
(note _deps() can be very fast on trivial entries)
dask/core.py
Outdated
| if as_list: | ||
| return {k: _deps(dsk, dsk[k]) for k in keys} | ||
| else: | ||
| return {k: set(_deps(dsk, dsk[k])) for k in keys} |
There was a problem hiding this comment.
Same question as above. Do we still need this function?
There was a problem hiding this comment.
In the case above I now see that we're sending a list of tasks. In this case though we're just using a dict comprehension. That dict-comprehension could presumably be inlined. I suppose the gains here are based on the type checks and avoiding the extra function call?
There was a problem hiding this comment.
Yes, they are. We could further inline the dict comprehension, but at the price of code duplication, so I'm not sure that it's a good idea.
There was a problem hiding this comment.
If it's a single line dict comprehension then I would prefer the code duplication. My opinion is that code indirection (having to follow many different functions to find the logic you want) can be as bad as code duplication in some cases.
There was a problem hiding this comment.
_deps() is a private API, should we make it public? Use get_dependencies instead (at the price of a small slowdown - cull() seems to become 15% slower)?
There was a problem hiding this comment.
Thoughts on removing _deps entirely and folding it into get_dependencies? We could place the burden of arg/task inputs and list/set outputs on the caller.
There was a problem hiding this comment.
This can also happen later. Mostly I want to push on the principle of "there is a maintenance cost to having lots of functions that do similar things". If we can have a few of these that compose well with standard language constructs like dictionary comprehensions then I think that dask.core will remain an easy place to get started for newish developers.
There was a problem hiding this comment.
Definitely agreed. I'll try to fold _deps into get_dependencies.
|
@eriknw some of the performance optimizations to dask.core and dask.optimize might interest you. |
|
Hmm, I'm not sure why 831be05 produced the following failure on AppVeyor: Could it be an unrelated intermittent failure? One possible explanation is that |
|
Yes, I suspect that this is caused by empty. We should probably replace with random.random |
|
Thanks for the ping, I'll take a look when convenient. |
|
Two thoughts:
|
|
For distributed, see dask/distributed#616. |
|
OK |
|
This seems fine to me. I recommend that we wait a bit for dask/distributed#616 to get in and in case @eriknw has some time for review. Otherwise I plan to merge tomorrow morning (around 18 hours from now) |
.travis.yml
Outdated
| - pip install git+https://github.com/mrocklin/partd --upgrade | ||
| - pip install git+https://github.com/mrocklin/cachey --upgrade | ||
| - pip install git+https://github.com/dask/zict --upgrade | ||
| - pip install git+https://github.com/dask/distributed --upgrade |
There was a problem hiding this comment.
I'm trying to fix the following failure:
https://travis-ci.org/dask/dask/jobs/172618031#L731
I'm not sure why the build worked previously.
There was a problem hiding this comment.
Now I'm getting another similar issue with s3fs: https://travis-ci.org/dask/dask/jobs/172623691#L751
Do you have an idea what may be happening?
There was a problem hiding this comment.
s3fs is included as a dependency in the distributed conda package but not in the PyPI package. We keep the PyPI packages lightweight.
| new_work += w[1:] | ||
| elif typ is list: | ||
| new_work += w | ||
| elif typ is dict: |
There was a problem hiding this comment.
It's not added, it's part of the _deps() algorithm, rewritten iteratively, and folded here.
There was a problem hiding this comment.
I see, thanks. I'm confused why we want to do this with dicts. This was added in 05ddd68.
In [20]: d = {'x': 1, 'y': {'a': 'x'}}
In [21]: dask.core.get_dependencies(d, 'y')
Out[21]: {'x'}
In [22]: dask.core.get(d, ['y'])
Out[22]: ({'a': 'x'},)
In [23]: dask.get(d, ['y'])
---------------------------------------------------------------------------
KeyError: 'y'There was a problem hiding this comment.
@eriknw we do this for the distributed scheduler. It ended up being useful, though at the moment I don't particularly recall the original use case.
In [1]: from distributed import Client
In [2]: c = Client()
In [3]: d = {'x': 1, 'y': {'a': 'x'}}
In [4]: c.get(d, 'y')
Out[4]: {'a': 1}There was a problem hiding this comment.
I agree the dict solution doesn't look pretty.
There was a problem hiding this comment.
Here's a simple, non-recursive helper function that makes the intent explicit and doesn't require changing the spec:
In [35]: def subdask(d):
...: return (dict, (zip, tuple(d.keys()), list(d.values())))
...:
...: d = {
...: 'a': 1,
...: 'b': 2,
...: 'c': subdask({'a': 'a', 'b': 'b'}),
...: }
...:
In [36]: d
Out[36]: {'a': 1, 'b': 2, 'c': (dict, (<function zip>, ('a', 'b'), ['a', 'b']))}
In [37]: dask.core.get(d, 'c')
Out[37]: {'a': 1, 'b': 2}There was a problem hiding this comment.
By the way, I'm assuming this discussion isn't an issue for the current PR, right? :-)
There was a problem hiding this comment.
Yeah. The PR lgtm. It keeps the existing behavior with dicts, which we can revisit elsewhere.
|
Merging in 12 hours if no comment |
Despite the PR title, this also brings more generic optimizations to dependency computations, and various optimizations.