Reimplement argtopk to release the GIL#3610
Conversation
| @@ -1,248 +1,295 @@ | |||
| """ A set of NumPy functions to apply per chunk """ | |||
There was a problem hiding this comment.
Looks like a github bug getting confused by DOS/Unix line terminations? The commit looks right both in PyCharm and in Gitkraken...
|
@piercefreeman @mrocklin ready for review and merge |
mrocklin
left a comment
There was a problem hiding this comment.
A few minor comments, mostly stylistic.
@crusaderky you may be over-estimating the intelligence of reviewers. If you have time to say what you did and how you solved the problem that would help. It takes me a surprisingly long time to get into how someone else solves a problem only by looking at their code.
dask/array/chunk.py
Outdated
| for i in range(a.ndim) | ||
| ]] | ||
| k_slice = slice(-k, None) if k > 0 else slice(-k) | ||
| return takeslice(a, k_slice, axis=axis) |
There was a problem hiding this comment.
I have a slight preference to keep this as it was if it's not critical. I find that dereferencing small functions like this makes it difficult for casual readers to understand a codebase. Obviously there is a point where a bit of code is frequently enough used or needs to be tested in isolation that pulling it off into a function makes sense, but it's not clear to me that this has yet gotten to that point.
There was a problem hiding this comment.
I personally find the original incredibly hard to read. Ultimately there should be a numpy PR and a backport to numpy_compat.py. I changed it as you requested for now.
| Post-processes the output of topk, sorting the results internally. | ||
| def topk_aggregate(a, k, axis, keepdims): | ||
| """Final aggregation kernel of topk. | ||
| Invoke topk one final time and then sort the results internally. |
There was a problem hiding this comment.
Nitpick, include a space or endline after the first """ and separate the header line by an empty line. See http://numpydoc.readthedocs.io/en/latest/ or http://dask.pydata.org/en/latest/develop.html#docstrings
| def reduction(x, chunk, aggregate, axis=None, keepdims=None, dtype=None, | ||
| split_every=None, combine=None, name=None, out=None): | ||
| split_every=None, combine=None, name=None, out=None, | ||
| concatenate=True, output_size=1): |
There was a problem hiding this comment.
Adding two new keyword arguments to this function seems important. Can I ask you to explain why they were necessary, perhaps in the top comment of this PR?
Also, I wanted to ask you to also add them to the docstring, but I see that the current docstring is unfortunately quite sprase. Git blame shows this to be my fault :/
There was a problem hiding this comment.
I'm writing the whole docstring...
I'd also like to change the function signature to reduction(x, chunk, combine, aggregate, ...) as I feel the combine function is fundamental and should not be relegated to a kwarg. Also it makes it easier for the user to wrap his head around it if the order of the arguments matches the order of execution.
Do you agree to the change? I would make it in a separate PR to keep things clean.
| a low value can reduce cache size and network transfers, at the cost of | ||
| more CPU and a larger dask graph. | ||
|
|
||
| Omit to let dask heuristically decide a good default. A default can |
There was a problem hiding this comment.
A very poor heuristic by the way - this will be the object of another PR.
| split_every = dict.fromkeys(axis, n) | ||
| else: | ||
| split_every = dict((k, v) for (k, v) in enumerate(x.numblocks) if k in axis) | ||
| raise ValueError("split_every must be a int or a dict") |
There was a problem hiding this comment.
Removed undocumented feature where split_every="your mom" meant a reduction in exactly 2 passes. We could formally reintroduce it in the form of split_every=-1 as part of a separate PR.
There was a problem hiding this comment.
Also, split_every=None defaults to config.get('split_every', 4), while split_every=dict() defaults to 2 for the missing axes. Does not seem very coherent to me - but changing it would be out of scope of this PR.
|
@mrocklin overhauled all docstrings. Added extensive comments in both topk and argtopk that explain how the algorithm is implemented. |
|
Travis failure is unrelated and caused by #3578 |
|
Could you please try merging with or rebasing on |
|
Branch seems to take care of my issue in #3596. LGTM @crusaderky |
mrocklin
left a comment
There was a problem hiding this comment.
In general this looks good to me. Thank you for adding the comprehensive docstrings @crusaderky . A few small comments below.
I also appreciate your patience with review on this. We seem to be low on reviewers these days.
| assert_eq(npfunc(c, axis=0)[-1:][::-1], | ||
| daskfunc(c, 1, split_every=split_every)) | ||
| assert_eq(npfunc(c, axis=0)[:1], | ||
| daskfunc(c, -1, split_every=split_every)) |
There was a problem hiding this comment.
Why were these tests removed? Were they no longer important for some reason?
There was a problem hiding this comment.
The previous implementation stuffed the data and the index into a recarray. The test was necessary to verify that you could transparently build a nested recarray. Now that I'm using tuples of arrays, there's no reason to think recarrays will behave any different from anything else.
docs/source/changelog.rst
Outdated
| +++++ | ||
|
|
||
| - | ||
| - Reimplemented ``argtopk`` to make it release the GIL (:pr:`3596`) `Guido Imperiale` |
There was a problem hiding this comment.
You'll want to terminate your name here with an underscore like
`Guido Imperiale`_
dask/array/reductions.py
Outdated
| dask array | ||
|
|
||
| Kernel Parameters | ||
| ----------------- |
There was a problem hiding this comment.
My experience has been that sphinx tends to drop section headers that don't match one of their known set. I tend to revert to using boldface instead like **Kernel Parameters**. It would be good to check that things haven't changed though.
dask/array/reductions.py
Outdated
| x: Array | ||
| Data being reduced along one or more axes | ||
| chunk: callable(x_chunk, axis, keepdims) | ||
| First kernel function to be executed when resolving the dask graph. |
There was a problem hiding this comment.
The term kernel might not be well understood by users. I wonder if we can use the term function instead throughout this docstring. I suspect that this will be more familiar to novice users.
dask/array/reductions.py
Outdated
| combine steps do not produce np.arrays. | ||
| output_size: int >= 1, optional | ||
| Size of the output of the ``aggregate`` kernel along the reduced axes. | ||
| Ignored if keepdims is False. |
There was a problem hiding this comment.
These seem like good changes to me. Thank you for adding their explanation.
|
@mrocklin on holiday for a week - I'll incorporate your suggestions when I'm back |
|
Enjoy the time off!
…On Thu, Jun 21, 2018 at 2:06 PM, crusaderky ***@***.***> wrote:
@mrocklin <https://github.com/mrocklin> on holiday for a week - I'll
incorporate your suggestions when I'm back
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3610 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszEVi8eXWno0CCdiQ1fDgdles_VyOks5t--CYgaJpZM4Un281>
.
|
|
incorporated all of @mrocklin's suggestions; ready to merge |
|
Agreed. Merging. Thank you for this @crusaderky ! |
….com/convexset/dask into fix-tsqr-case-chunk-with-zero-height * 'fix-tsqr-case-chunk-with-zero-height' of https://github.com/convexset/dask: fixed typo in documentation and improved clarity Implement .blocks accessor (dask#3689) Fix wrong names (dask#3695) Adds endpoint and retstep support for linspace (dask#3675) Add the @ operator to the delayed objects (dask#3691) Align auto chunks to provided chunks, rather than shape (dask#3679) Adds quotes to source pip install (dask#3678) Prefer end-tasks with low numbers of dependencies when ordering (dask#3588) Reimplement argtopk to release the GIL (dask#3610) Note `da.pad` can be used with `map_overlap` (dask#3672) Allow tasks back onto ordering stack if they have one dependency (dask#3652) Fix extra progressbar (dask#3669) Break apart uneven array-of-int slicing to separate chunks (dask#3648) fix for `dask.array.linalg.tsqr` fails tests (intermittently) with arrays of uncertain dimensions (dask#3662)
Closes #3596