Allow custom sort functions for dask-cudf `sort_values` by charlesbluca · Pull Request #9789 · rapidsai/cudf

charlesbluca · 2021-11-29T18:26:41Z

Similar to dask/dask#8345, this PR allows the sorting function called on each partition in last step of dask-cudf's sort_values to be generalized, along with the kwargs that are supplied to it. This allows sort_values to be extended to support more complex ascending / null position handling.

The context for this PR is a desire to simplify the sorting algorithm used by dask-sql; since it only really differs from dask-cudf's sorting algorithm in that it uses a custom sorting function, it seems like it would be easier to allow for that extension upstream rather than duplicate code in dask-sql.

codecov · 2021-11-29T20:07:27Z

Codecov Report

Merging #9789 (bc9291c) into branch-22.02 (967a333) will decrease coverage by 0.04%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.02    #9789      +/-   ##
================================================
- Coverage         10.49%   10.44%   -0.05%     
================================================
  Files               119      119              
  Lines             20305    20476     +171     
================================================
+ Hits               2130     2139       +9     
- Misses            18175    18337     +162

Impacted Files	Coverage Δ
python/custreamz/custreamz/kafka.py	`29.16% <0.00%> (-0.63%)`	⬇️
python/dask_cudf/dask_cudf/sorting.py	`92.66% <0.00%> (-0.25%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`70.85% <0.00%> (-0.17%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/dtypes.py	`0.00% <0.00%> (ø)`
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d5ec7f...bc9291c. Read the comment docs.

charlesbluca · 2021-12-01T15:46:46Z

rerun tests

charlesbluca · 2021-12-08T15:30:24Z

rerun tests

…-functions

github-actions · 2022-01-08T22:04:23Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

rjzamora

The general change here makes sense to me, thanks for working on this @charlesbluca !

My main comment/suggestion is to avoid "breaking" API changes by moving the default-handling logic into sorting.sort_values.

python/dask_cudf/dask_cudf/core.py

rjzamora · 2022-01-10T16:48:29Z

python/dask_cudf/dask_cudf/sorting.py

-    df4 = df3.map_partitions(
-        M.sort_values, by, ascending=ascending, na_position=na_position
-    )
+    df4 = df3.map_partitions(sort_function, **sort_function_kwargs)


Something feels off here. We are requiring that the user specify sort_function, but the API makes it seem optional. I worry that we are now silently ignoring acsending and na_position (and maybe even by?).

What if down-stream users are implementing code with sorting.sort_values directly? I don't think that is good/recommended practice, but the API we are changing seems "public" to me (making this a breaking change).

Perhaps a simpler (non-breaking) solution would be to remove most of the changes from DataFrame.sort_values, pass through sort_function and sort_function_kwargs into here, and implement the sort_function/sort_function_kwargs default logic here (in sorting.sort_values). Does this seem reasonable?

That makes sense and is a valid concern - my only comment is that we ideally still want to allow for custom sorting functions in the npartitions == 1 case that is handled directly in DataFrame.sort_values, so I think it might also make sense to move the following logic:

if self.npartitions == 1: df = self.map_partitions(sort_function, **sort_kwargs)

into sorting.sort_values as well, unless there's a reason that's not immediately obvious to me why we would want to keep the single partition case separate?

Also noting that this is also a concern for the upstream implementation of this, so depending on what we decide on here I will open up a follow up PR to address this in Dask.

Also noting that this is also a concern for the upstream implementation of this, so depending on what we decide on here I will open up a follow up PR to address this in Dask.

Good point! I definitely like the simplification you made here. So it probably makes sense to do something similar upstream.

python/dask_cudf/dask_cudf/tests/test_sort.py

…-functions

rjzamora

Thanks for revising this @charlesbluca! Everything looks great to me.

charlesbluca · 2022-01-14T15:11:44Z

@gpucibot merge

This PR moves the handling of custom sorting functions to `shuffle.sort_values`, so that usages of the internal `sort_values` function will not have to manually specify a default `sort_function` and `sort_function_kwargs`. This originated as a concern in the downstream implementation of this in rapidsai/cudf#9789

Allow custom sorting functions for dask-cudf sort_values

c2e0fec

charlesbluca added 2 - In Progress Currently a work in progress dask-cudf improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 29, 2021

github-actions bot added the Python Affects Python cuDF API. label Nov 29, 2021

Add custom sort function test

ca8e497

charlesbluca added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 29, 2021

charlesbluca marked this pull request as ready for review November 29, 2021 19:00

charlesbluca requested a review from a team as a code owner November 29, 2021 19:00

charlesbluca requested review from galipremsagar and rjzamora November 30, 2021 15:03

galipremsagar approved these changes Nov 30, 2021

View reviewed changes

charlesbluca mentioned this pull request Dec 1, 2021

Use upstream Dask for complex sorting operations dask-contrib/dask-sql#336

Merged

charlesbluca added 2 commits December 9, 2021 10:32

Merge remote-tracking branch 'upstream/branch-22.02' into custom-sort…

6550072

…-functions

Merge remote-tracking branch 'upstream/branch-22.02' into custom-sort…

4770ef0

…-functions

github-actions bot added the inactive-30d label Jan 8, 2022

rjzamora reviewed Jan 10, 2022

View reviewed changes

charlesbluca added 3 commits January 10, 2022 14:12

Merge remote-tracking branch 'upstream/branch-22.02' into custom-sort…

ed2785d

…-functions

Move custom sort function logic to internal sort_values

e54f1bf

Use correct sort kwargs for map_partitions call

bc9291c

rjzamora approved these changes Jan 14, 2022

View reviewed changes

rapids-bot bot merged commit ca77542 into rapidsai:branch-22.02 Jan 14, 2022

charlesbluca mentioned this pull request Jan 14, 2022

Move custom sort function logic to internal sort_values dask/dask#8571

Merged

3 tasks

charlesbluca deleted the custom-sort-functions branch July 19, 2022 14:26

vyasr added dask Dask issue and removed dask-cudf labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow custom sort functions for dask-cudf `sort_values`#9789

Allow custom sort functions for dask-cudf `sort_values`#9789
rapids-bot[bot] merged 7 commits intorapidsai:branch-22.02from
charlesbluca:custom-sort-functions

charlesbluca commented Nov 29, 2021

Uh oh!

codecov bot commented Nov 29, 2021 •

edited

Loading

Uh oh!

charlesbluca commented Dec 1, 2021

Uh oh!

charlesbluca commented Dec 8, 2021

Uh oh!

github-actions bot commented Jan 8, 2022

Uh oh!

rjzamora left a comment

Uh oh!

Uh oh!

rjzamora Jan 10, 2022

Uh oh!

charlesbluca Jan 10, 2022

Uh oh!

rjzamora Jan 14, 2022

Uh oh!

Uh oh!

rjzamora left a comment

Uh oh!

charlesbluca commented Jan 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

charlesbluca commented Nov 29, 2021

Uh oh!

codecov bot commented Nov 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

charlesbluca commented Dec 1, 2021

Uh oh!

charlesbluca commented Dec 8, 2021

Uh oh!

github-actions bot commented Jan 8, 2022

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rjzamora Jan 10, 2022

Choose a reason for hiding this comment

Uh oh!

charlesbluca Jan 10, 2022

Choose a reason for hiding this comment

Uh oh!

rjzamora Jan 14, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

charlesbluca commented Jan 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Nov 29, 2021 •

edited

Loading