Allow custom sort functions for dask-cudf sort_values#9789
Allow custom sort functions for dask-cudf sort_values#9789rapids-bot[bot] merged 7 commits intorapidsai:branch-22.02from
sort_values#9789Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9789 +/- ##
================================================
- Coverage 10.49% 10.44% -0.05%
================================================
Files 119 119
Lines 20305 20476 +171
================================================
+ Hits 2130 2139 +9
- Misses 18175 18337 +162
Continue to review full report at Codecov.
|
|
rerun tests |
|
rerun tests |
|
This PR has been labeled |
rjzamora
left a comment
There was a problem hiding this comment.
The general change here makes sense to me, thanks for working on this @charlesbluca !
My main comment/suggestion is to avoid "breaking" API changes by moving the default-handling logic into sorting.sort_values.
| df4 = df3.map_partitions( | ||
| M.sort_values, by, ascending=ascending, na_position=na_position | ||
| ) | ||
| df4 = df3.map_partitions(sort_function, **sort_function_kwargs) |
There was a problem hiding this comment.
Something feels off here. We are requiring that the user specify sort_function, but the API makes it seem optional. I worry that we are now silently ignoring acsending and na_position (and maybe even by?).
What if down-stream users are implementing code with sorting.sort_values directly? I don't think that is good/recommended practice, but the API we are changing seems "public" to me (making this a breaking change).
Perhaps a simpler (non-breaking) solution would be to remove most of the changes from DataFrame.sort_values, pass through sort_function and sort_function_kwargs into here, and implement the sort_function/sort_function_kwargs default logic here (in sorting.sort_values). Does this seem reasonable?
There was a problem hiding this comment.
That makes sense and is a valid concern - my only comment is that we ideally still want to allow for custom sorting functions in the npartitions == 1 case that is handled directly in DataFrame.sort_values, so I think it might also make sense to move the following logic:
if self.npartitions == 1:
df = self.map_partitions(sort_function, **sort_kwargs)into sorting.sort_values as well, unless there's a reason that's not immediately obvious to me why we would want to keep the single partition case separate?
Also noting that this is also a concern for the upstream implementation of this, so depending on what we decide on here I will open up a follow up PR to address this in Dask.
There was a problem hiding this comment.
Also noting that this is also a concern for the upstream implementation of this, so depending on what we decide on here I will open up a follow up PR to address this in Dask.
Good point! I definitely like the simplification you made here. So it probably makes sense to do something similar upstream.
rjzamora
left a comment
There was a problem hiding this comment.
Thanks for revising this @charlesbluca! Everything looks great to me.
|
@gpucibot merge |
This PR moves the handling of custom sorting functions to `shuffle.sort_values`, so that usages of the internal `sort_values` function will not have to manually specify a default `sort_function` and `sort_function_kwargs`. This originated as a concern in the downstream implementation of this in rapidsai/cudf#9789
Similar to dask/dask#8345, this PR allows the sorting function called on each partition in last step of dask-cudf's
sort_valuesto be generalized, along with the kwargs that are supplied to it. This allowssort_valuesto be extended to support more complex ascending / null position handling.The context for this PR is a desire to simplify the sorting algorithm used by dask-sql; since it only really differs from dask-cudf's sorting algorithm in that it uses a custom sorting function, it seems like it would be easier to allow for that extension upstream rather than duplicate code in dask-sql.