Reimplement argtopk to release the GIL by crusaderky · Pull Request #3610 · dask/dask

crusaderky · 2018-06-14T12:23:23Z

crusaderky · 2018-06-14T12:28:08Z

dask/array/chunk.py

@@ -1,248 +1,295 @@
-""" A set of NumPy functions to apply per chunk """


Looks like a github bug getting confused by DOS/Unix line terminations? The commit looks right both in PyCharm and in Gitkraken...

crusaderky · 2018-06-14T16:23:40Z

@piercefreeman @mrocklin ready for review and merge

mrocklin

A few minor comments, mostly stylistic.

@crusaderky you may be over-estimating the intelligence of reviewers. If you have time to say what you did and how you solved the problem that would help. It takes me a surprisingly long time to get into how someone else solves a problem only by looking at their code.

mrocklin · 2018-06-15T02:29:29Z

dask/array/chunk.py

-        for i in range(a.ndim)
-    ]]
+    k_slice = slice(-k, None) if k > 0 else slice(-k)
+    return takeslice(a, k_slice, axis=axis)


I have a slight preference to keep this as it was if it's not critical. I find that dereferencing small functions like this makes it difficult for casual readers to understand a codebase. Obviously there is a point where a bit of code is frequently enough used or needs to be tested in isolation that pulling it off into a function makes sense, but it's not clear to me that this has yet gotten to that point.

I personally find the original incredibly hard to read. Ultimately there should be a numpy PR and a backport to numpy_compat.py. I changed it as you requested for now.

mrocklin · 2018-06-15T02:31:05Z

dask/array/chunk.py

-    Post-processes the output of topk, sorting the results internally.
+def topk_aggregate(a, k, axis, keepdims):
+    """Final aggregation kernel of topk.
+    Invoke topk one final time and then sort the results internally.


Nitpick, include a space or endline after the first """ and separate the header line by an empty line. See http://numpydoc.readthedocs.io/en/latest/ or http://dask.pydata.org/en/latest/develop.html#docstrings

mrocklin · 2018-06-15T02:36:54Z

dask/array/reductions.py

 def reduction(x, chunk, aggregate, axis=None, keepdims=None, dtype=None,
-              split_every=None, combine=None, name=None, out=None):
+              split_every=None, combine=None, name=None, out=None,
+              concatenate=True, output_size=1):


Adding two new keyword arguments to this function seems important. Can I ask you to explain why they were necessary, perhaps in the top comment of this PR?

Also, I wanted to ask you to also add them to the docstring, but I see that the current docstring is unfortunately quite sprase. Git blame shows this to be my fault :/

I'm writing the whole docstring...

I'd also like to change the function signature to reduction(x, chunk, combine, aggregate, ...) as I feel the combine function is fundamental and should not be relegated to a kwarg. Also it makes it easier for the user to wrap his head around it if the order of the arguments matches the order of execution.
Do you agree to the change? I would make it in a separate PR to keep things clean.

edited typos

crusaderky · 2018-06-16T17:25:41Z

dask/array/reductions.py

+        a low value can reduce cache size and network transfers, at the cost of
+        more CPU and a larger dask graph.
+
+        Omit to let dask heuristically decide a good default. A default can


A very poor heuristic by the way - this will be the object of another PR.

crusaderky · 2018-06-16T17:28:11Z

dask/array/reductions.py

        split_every = dict.fromkeys(axis, n)
    else:
-        split_every = dict((k, v) for (k, v) in enumerate(x.numblocks) if k in axis)
+        raise ValueError("split_every must be a int or a dict")


Removed undocumented feature where split_every="your mom" meant a reduction in exactly 2 passes. We could formally reintroduce it in the form of split_every=-1 as part of a separate PR.

Also, split_every=None defaults to config.get('split_every', 4), while split_every=dict() defaults to 2 for the missing axes. Does not seem very coherent to me - but changing it would be out of scope of this PR.

crusaderky · 2018-06-16T17:34:52Z

@mrocklin overhauled all docstrings. Added extensive comments in both topk and argtopk that explain how the algorithm is implemented.

crusaderky · 2018-06-16T18:16:12Z

Travis failure is unrelated and caused by #3578

jakirkham · 2018-06-16T18:29:38Z

Could you please try merging with or rebasing on master? Expect that will fix this the issue.

piercefreeman · 2018-06-18T21:13:20Z

Branch seems to take care of my issue in #3596. LGTM @crusaderky

mrocklin

In general this looks good to me. Thank you for adding the comprehensive docstrings @crusaderky . A few small comments below.

I also appreciate your patience with review on this. We seem to be low on reviewers these days.

mrocklin · 2018-06-20T14:50:58Z

dask/array/tests/test_reductions.py

-    assert_eq(npfunc(c, axis=0)[-1:][::-1],
-              daskfunc(c, 1, split_every=split_every))
-    assert_eq(npfunc(c, axis=0)[:1],
-              daskfunc(c, -1, split_every=split_every))


Why were these tests removed? Were they no longer important for some reason?

The previous implementation stuffed the data and the index into a recarray. The test was necessary to verify that you could transparently build a nested recarray. Now that I'm using tuples of arrays, there's no reason to think recarrays will behave any different from anything else.

mrocklin · 2018-06-20T14:51:30Z

docs/source/changelog.rst

 +++++

-
+- Reimplemented ``argtopk`` to make it release the GIL (:pr:`3596`) `Guido Imperiale`


You'll want to terminate your name here with an underscore like

`Guido Imperiale`_

mrocklin · 2018-06-20T14:53:21Z

dask/array/reductions.py

+    dask array
+
+    Kernel Parameters
+    -----------------


My experience has been that sphinx tends to drop section headers that don't match one of their known set. I tend to revert to using boldface instead like **Kernel Parameters**. It would be good to check that things haven't changed though.

mrocklin · 2018-06-20T14:54:30Z

dask/array/reductions.py

+    x: Array
+        Data being reduced along one or more axes
+    chunk: callable(x_chunk, axis, keepdims)
+        First kernel function to be executed when resolving the dask graph.


The term kernel might not be well understood by users. I wonder if we can use the term function instead throughout this docstring. I suspect that this will be more familiar to novice users.

mrocklin · 2018-06-20T14:54:53Z

dask/array/reductions.py

+        combine steps do not produce np.arrays.
+    output_size: int >= 1, optional
+        Size of the output of the ``aggregate`` kernel along the reduced axes.
+        Ignored if keepdims is False.


These seem like good changes to me. Thank you for adding their explanation.

crusaderky · 2018-06-21T18:06:15Z

@mrocklin on holiday for a week - I'll incorporate your suggestions when I'm back

mrocklin · 2018-06-21T18:12:20Z

Enjoy the time off!

…

On Thu, Jun 21, 2018 at 2:06 PM, crusaderky ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> on holiday for a week - I'll incorporate your suggestions when I'm back — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3610 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszEVi8eXWno0CCdiQ1fDgdles_VyOks5t--CYgaJpZM4Un281> .

crusaderky · 2018-06-27T17:52:24Z

incorporated all of @mrocklin's suggestions; ready to merge

mrocklin · 2018-06-27T18:02:08Z

Agreed. Merging. Thank you for this @crusaderky !

….com/convexset/dask into fix-tsqr-case-chunk-with-zero-height * 'fix-tsqr-case-chunk-with-zero-height' of https://github.com/convexset/dask: fixed typo in documentation and improved clarity Implement .blocks accessor (dask#3689) Fix wrong names (dask#3695) Adds endpoint and retstep support for linspace (dask#3675) Add the @ operator to the delayed objects (dask#3691) Align auto chunks to provided chunks, rather than shape (dask#3679) Adds quotes to source pip install (dask#3678) Prefer end-tasks with low numbers of dependencies when ordering (dask#3588) Reimplement argtopk to release the GIL (dask#3610) Note `da.pad` can be used with `map_overlap` (dask#3672) Allow tasks back onto ordering stack if they have one dependency (dask#3652) Fix extra progressbar (dask#3669) Break apart uneven array-of-int slicing to separate chunks (dask#3648) fix for `dask.array.linalg.tsqr` fails tests (intermittently) with arrays of uncertain dimensions (dask#3662)

gimperiale and others added 2 commits June 9, 2018 22:08

Merge remote-tracking branch 'dask/master'

0f93daf

Reimplement argtopk to release the GIL

f9b0847

crusaderky commented Jun 14, 2018

View reviewed changes

crusaderky added 4 commits June 14, 2018 13:30

dos2unix

7fa8fae

Fix doctest

188e424

Fix doctest (2)

c79078e

Fix doctest (3)

722f34f

Documentation tweaks

bdadcca

crusaderky closed this Jun 14, 2018

crusaderky reopened this Jun 14, 2018

mrocklin reviewed Jun 15, 2018

View reviewed changes

crusaderky added 5 commits June 16, 2018 16:58

Remove takeslice

e7b9c23

.gitignore additions

09c7378

Documentation bonanza

d06c8d4

Remove support for split_every="your mom"

af9425f

Merge remote-tracking branch 'dask/master' into argtopk_nogil

0f37b82

crusaderky commented Jun 16, 2018

View reviewed changes

changelog

a7825b8

Merge remote-tracking branch 'dask/master' into argtopk_nogil

2a4db53

Merge branch 'master' into argtopk_nogil

434f4d0

mrocklin reviewed Jun 20, 2018

View reviewed changes

crusaderky added 2 commits June 27, 2018 17:37

Merge remote-tracking branch 'dask/master' into argtopk_nogil

ea96871

kernel -> function

79d6085

crusaderky added 2 commits June 27, 2018 17:58

Normalize changelog style

7cbb1c0

More docs tweaks

d014eee

mrocklin merged commit 5fe6e10 into dask:master Jun 27, 2018

crusaderky mentioned this pull request Jun 27, 2018

Add NumPy's new take_along_axis #3663

Open

crusaderky deleted the argtopk_nogil branch June 27, 2018 19:01

		@@ -1,248 +1,295 @@
		""" A set of NumPy functions to apply per chunk """

Uh oh!

Conversation

crusaderky commented Jun 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Jun 14, 2018

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky Jun 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Jun 16, 2018

Uh oh!

crusaderky commented Jun 16, 2018

Uh oh!

jakirkham commented Jun 16, 2018

Uh oh!

piercefreeman commented Jun 18, 2018

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Jun 21, 2018

Uh oh!

mrocklin commented Jun 21, 2018 via email

Uh oh!

crusaderky commented Jun 27, 2018

Uh oh!

mrocklin commented Jun 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

crusaderky Jun 16, 2018 •

edited

Loading