ENH: Adds sort_values to dask.DataFrame#7286
Conversation
|
This seems like a reasonable solution to me and I feel like we are open to an implementation of
I'd prefer that this much duplication be abstracted away a bit.
We generally use pre-commit to run black. You can read more about how that works here: https://docs.dask.org/en/latest/develop.html#code-formatting |
|
Thanks for the feedback - I'll get this cleaned up. |
|
@jsignell I think this is good to go, let me know if there are changes I should make. |
jsignell
left a comment
There was a problem hiding this comment.
There are some lingering issues to be resolved, but this PR is coming along nicely!
dask/dataframe/core.py
Outdated
| divisions: list, optional | ||
| Known values on which to separate index values of the partitions. | ||
| See https://docs.dask.org/en/latest/dataframe-design.html#partitions | ||
| Defaults to computing this with a single pass over the data. Note | ||
| that if ``sorted=True``, specified divisions are assumed to match | ||
| the existing partitions in the data. If ``sorted=False``, you should | ||
| leave divisions empty and call ``repartition`` after ``set_index``. |
There was a problem hiding this comment.
This docstring looks copied from set_index and I'm not sure how much of it applies to sort_values. In particular, it is probably safe to assume that the column is not sorted, so I don't think divisions should even be an option in this method.
There was a problem hiding this comment.
Yeah - you're right here. I was copying the interface from set_index and this doesn't make sense, removing.
dask/dataframe/shuffle.py
Outdated
| divisions = mins + [maxes[-1]] | ||
| return df.map_partitions(M.sort_values, value) | ||
| df = rearrange_by_divisions(df, value, divisions) | ||
| df.divisions = divisions |
There was a problem hiding this comment.
I don't think you want to set df.divisions to divisions. The output of _calculate_division is the divisions that would be on the sort_by_col if it were the index, but in the case of sort_by_values the column does not become the index, so the divisions in this method are not equivalent to the divisions on the resultant dataframe.
dask/dataframe/tests/test_shuffle.py
Outdated
| tm.assert_frame_equal( | ||
| ddf.sort_values("a").compute().reset_index(drop=True), | ||
| df.sort_values("a").reset_index(drop=True), | ||
| ) |
There was a problem hiding this comment.
It's preferable to use assert_eq like the rest of the tests in this file do. That helper function checks properties of the dask dataframe before compute is called. By changing this test to use assert_eq you'll see the divisions issue that I mentioned above.
There was a problem hiding this comment.
Ah! I see - thanks!
|
This should be in a better state now - thanks for catching those oversights. |
|
cc @rjzamora (in case this if of interest 🙂) |
Seems like the logical way to handle single-column Dask-CuDF actually does the same exact thing to support both single- and multi-column |
Ah that would be a nice extension. |
|
Thanks for the PR @gerrymanoim! |
| and npartitions == df.npartitions | ||
| ): | ||
| # divisions are in the right place | ||
| divisions = mins + [maxes[-1]] |
There was a problem hiding this comment.
just noticing while rebasing #7214: I think this line is dead code.
Not sure if this value was supposed to go anywhere, from skimming the review comments, it seems like it was originally passed to something but is no longer (which is correct / as it should be), so this line can just be removed.
| partition_size=128e6, | ||
| **kwargs, | ||
| ): | ||
| """ See _Frame.sort_values for docstring """ |
There was a problem hiding this comment.
I think it ended up as DataFrame.sort_values, though moving to _Frame does sound like it should "just work" and would be beneficial
There was a problem hiding this comment.
actually Series.sort_values() needs a different API than DataFrame.sort_values (former does not take a by column name), so someone can add it in another PR; I just updated this comment in #7462
black dask/flake8 dask. Running black oncore.pyandshuffle.pycaused large diffs - maybe I have something misconfigured?As suggested #958 (comment) and #2367, I've pretty much reused the code in
set_indexand don't mess with the underlying bits. Shared code betweenset_indexandsort_valuesis in_calculate_divisions.Implementation caveats:
ascending=Trueis supportedCloses #958