Skip to content

Reimplement argtopk to release the GIL#3610

Merged
mrocklin merged 19 commits intodask:masterfrom
crusaderky:argtopk_nogil
Jun 27, 2018
Merged

Reimplement argtopk to release the GIL#3610
mrocklin merged 19 commits intodask:masterfrom
crusaderky:argtopk_nogil

Conversation

@crusaderky
Copy link
Collaborator

Closes #3596

@@ -1,248 +1,295 @@
""" A set of NumPy functions to apply per chunk """
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a github bug getting confused by DOS/Unix line terminations? The commit looks right both in PyCharm and in Gitkraken...

@crusaderky
Copy link
Collaborator Author

@piercefreeman @mrocklin ready for review and merge

@crusaderky crusaderky closed this Jun 14, 2018
@crusaderky crusaderky reopened this Jun 14, 2018
Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, mostly stylistic.

@crusaderky you may be over-estimating the intelligence of reviewers. If you have time to say what you did and how you solved the problem that would help. It takes me a surprisingly long time to get into how someone else solves a problem only by looking at their code.

for i in range(a.ndim)
]]
k_slice = slice(-k, None) if k > 0 else slice(-k)
return takeslice(a, k_slice, axis=axis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference to keep this as it was if it's not critical. I find that dereferencing small functions like this makes it difficult for casual readers to understand a codebase. Obviously there is a point where a bit of code is frequently enough used or needs to be tested in isolation that pulling it off into a function makes sense, but it's not clear to me that this has yet gotten to that point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find the original incredibly hard to read. Ultimately there should be a numpy PR and a backport to numpy_compat.py. I changed it as you requested for now.

Post-processes the output of topk, sorting the results internally.
def topk_aggregate(a, k, axis, keepdims):
"""Final aggregation kernel of topk.
Invoke topk one final time and then sort the results internally.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, include a space or endline after the first """ and separate the header line by an empty line. See http://numpydoc.readthedocs.io/en/latest/ or http://dask.pydata.org/en/latest/develop.html#docstrings

def reduction(x, chunk, aggregate, axis=None, keepdims=None, dtype=None,
split_every=None, combine=None, name=None, out=None):
split_every=None, combine=None, name=None, out=None,
concatenate=True, output_size=1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding two new keyword arguments to this function seems important. Can I ask you to explain why they were necessary, perhaps in the top comment of this PR?

Also, I wanted to ask you to also add them to the docstring, but I see that the current docstring is unfortunately quite sprase. Git blame shows this to be my fault :/

Copy link
Collaborator Author

@crusaderky crusaderky Jun 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm writing the whole docstring...

I'd also like to change the function signature to reduction(x, chunk, combine, aggregate, ...) as I feel the combine function is fundamental and should not be relegated to a kwarg. Also it makes it easier for the user to wrap his head around it if the order of the arguments matches the order of execution.
Do you agree to the change? I would make it in a separate PR to keep things clean.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edited typos

a low value can reduce cache size and network transfers, at the cost of
more CPU and a larger dask graph.

Omit to let dask heuristically decide a good default. A default can
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very poor heuristic by the way - this will be the object of another PR.

split_every = dict.fromkeys(axis, n)
else:
split_every = dict((k, v) for (k, v) in enumerate(x.numblocks) if k in axis)
raise ValueError("split_every must be a int or a dict")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed undocumented feature where split_every="your mom" meant a reduction in exactly 2 passes. We could formally reintroduce it in the form of split_every=-1 as part of a separate PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, split_every=None defaults to config.get('split_every', 4), while split_every=dict() defaults to 2 for the missing axes. Does not seem very coherent to me - but changing it would be out of scope of this PR.

@crusaderky
Copy link
Collaborator Author

@mrocklin overhauled all docstrings. Added extensive comments in both topk and argtopk that explain how the algorithm is implemented.

@crusaderky
Copy link
Collaborator Author

Travis failure is unrelated and caused by #3578

@jakirkham
Copy link
Member

Could you please try merging with or rebasing on master? Expect that will fix this the issue.

@piercefreeman
Copy link

Branch seems to take care of my issue in #3596. LGTM @crusaderky

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks good to me. Thank you for adding the comprehensive docstrings @crusaderky . A few small comments below.

I also appreciate your patience with review on this. We seem to be low on reviewers these days.

assert_eq(npfunc(c, axis=0)[-1:][::-1],
daskfunc(c, 1, split_every=split_every))
assert_eq(npfunc(c, axis=0)[:1],
daskfunc(c, -1, split_every=split_every))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these tests removed? Were they no longer important for some reason?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation stuffed the data and the index into a recarray. The test was necessary to verify that you could transparently build a nested recarray. Now that I'm using tuples of arrays, there's no reason to think recarrays will behave any different from anything else.

+++++

-
- Reimplemented ``argtopk`` to make it release the GIL (:pr:`3596`) `Guido Imperiale`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to terminate your name here with an underscore like

`Guido Imperiale`_

dask array

Kernel Parameters
-----------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My experience has been that sphinx tends to drop section headers that don't match one of their known set. I tend to revert to using boldface instead like **Kernel Parameters**. It would be good to check that things haven't changed though.

x: Array
Data being reduced along one or more axes
chunk: callable(x_chunk, axis, keepdims)
First kernel function to be executed when resolving the dask graph.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term kernel might not be well understood by users. I wonder if we can use the term function instead throughout this docstring. I suspect that this will be more familiar to novice users.

combine steps do not produce np.arrays.
output_size: int >= 1, optional
Size of the output of the ``aggregate`` kernel along the reduced axes.
Ignored if keepdims is False.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like good changes to me. Thank you for adding their explanation.

@crusaderky
Copy link
Collaborator Author

@mrocklin on holiday for a week - I'll incorporate your suggestions when I'm back

@mrocklin
Copy link
Member

mrocklin commented Jun 21, 2018 via email

@crusaderky
Copy link
Collaborator Author

incorporated all of @mrocklin's suggestions; ready to merge

@mrocklin
Copy link
Member

Agreed. Merging. Thank you for this @crusaderky !

@mrocklin mrocklin merged commit 5fe6e10 into dask:master Jun 27, 2018
@crusaderky crusaderky deleted the argtopk_nogil branch June 27, 2018 19:01
convexset added a commit to convexset/dask that referenced this pull request Jul 1, 2018
….com/convexset/dask into fix-tsqr-case-chunk-with-zero-height

* 'fix-tsqr-case-chunk-with-zero-height' of https://github.com/convexset/dask:
  fixed typo in documentation and improved clarity
  Implement .blocks accessor (dask#3689)
  Fix wrong names (dask#3695)
  Adds endpoint and retstep support for linspace (dask#3675)
  Add the @ operator to the delayed objects (dask#3691)
  Align auto chunks to provided chunks, rather than shape (dask#3679)
  Adds quotes to source pip install (dask#3678)
  Prefer end-tasks with low numbers of dependencies when ordering (dask#3588)
  Reimplement argtopk to release the GIL (dask#3610)
  Note `da.pad` can be used with `map_overlap` (dask#3672)
  Allow tasks back onto ordering stack if they have one dependency (dask#3652)
  Fix extra progressbar (dask#3669)
  Break apart uneven array-of-int slicing to separate chunks (dask#3648)
  fix for `dask.array.linalg.tsqr` fails tests (intermittently) with arrays of uncertain dimensions (dask#3662)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants