Move custom sort function logic to internal `sort_values` by charlesbluca · Pull Request #8571 · dask/dask

charlesbluca · 2022-01-14T18:18:01Z

This PR moves the handling of custom sorting functions to shuffle.sort_values, so that usages of the internal sort_values function will not have to manually specify a default sort_function and sort_function_kwargs.

cc @rjzamora who raised this concern in the downstream implementation of this in rapidsai/cudf#9789

Closes #xxxx
Tests added / passed
Passes pre-commit run --all-files

charlesbluca · 2022-01-14T18:22:51Z

dask/dataframe/core.py

-        if self.npartitions == 1:
-            return self.map_partitions(sort_function, **sort_kwargs)


Noting that we would also need to move the single partition sort case to the internal sort_values so that custom sorting functions are supported there; this has the side effect of making it so multi-column sorting is no longer supported for single partition cases, which is niche enough that I don't think this would cause any downstream breakage

Could you explain this a bit more, maybe add a snippet or a test demonstrating the behavior change? From reading your comment, it seems that now the test you included in #8345 would fail, but it seems to pass?

Sure! So before if we were to try something like

import pandas as pd import dask.dataframe as dd df = pd.DataFrame({"x": [1, 2, 3, 1, 2, 3], "y": [1, 2, 3, 4, 5, 6]}) ddf = dd.from_pandas(df, npartitions=1) ddf.sort_values(by=["x", "y"]).compute()

Because the dataframe has a single partition, we would reach the map_partitions call in I removed above before reaching Dask's input validation checks in shuffle.sort_values, one of which checks that by is only a single column:

dask/dask/dataframe/shuffle.py

Lines 88 to 97 in 3e7d1d0

if not isinstance(by, str):

# support ["a"] as input

if isinstance(by, list) and len(by) == 1 and isinstance(by[0], str):

by = by[0]

else:

raise NotImplementedError(

"Dataframe only supports sorting by a single column which must "

"be passed as a string or a list of a single string.\n"

"You passed %s" % str(by)

)

Now that the map_partitions call is moved to be after these input validation checks, the above snippet would trigger the NotImplementedError, even though it is possible for the single-partition dataframe to be sorted by multiple columns.

As I'm writing this out, I realize that it should be possible to modify this check of by somewhat to allow multiple columns when the dataframe has a single partition - I will try making those changes now 🙂

it seems that now the test you included in #8345 would fail, but it seems to pass?

That test achieves multi-column sorting in a different, somewhat hacky way - by passing a single by-column to Dask's sort_values and multiple by-columns to the custom sorting function (which is essentially just a wrapper for the partition library's sort_values), Dask will perform a rough initial "sort" using the single column it has been provided before passing off the remaining work of the multi-column sort to the partition library (pandas, cuDF, etc.) that actually supports multi-column sorting.

Thanks for the detailed explanation! That makes sense, I'm wondering if it would be good to add in something like your above snippet for ddf.sort_values(by=["x", "y"]).compute() to a test?

Yeah that's a good idea - I'll add a single partition sorting test

scharlottej13 · 2022-01-20T18:13:19Z

I think #8345 and this conversation are helpful context.

bryanwweber · 2022-02-02T17:27:36Z

@ian-r-rose Can you take another look here when you get a chance?

ian-r-rose

Sorry for the slow review @charlesbluca!

ian-r-rose · 2022-02-16T23:55:00Z

dask/dataframe/tests/test_shuffle.py

 @pytest.mark.parametrize("by", ["a", "b"])
 @pytest.mark.parametrize("nelem", [10, 500])
-@pytest.mark.parametrize("nparts", [1, 10])
 def test_sort_values(nelem, nparts, by, ascending):


Looks like a legitimate set of test failures -- nparts is no longer a parameterized value here, and should be removed.

Good catch!

ian-r-rose · 2022-02-17T00:02:35Z

dask/dataframe/shuffle.py

+    if not isinstance(by, list):
+        by = [by]
+    if len(by) > 1 and df.npartitions > 1:
+        raise NotImplementedError(


I think it would still be helpful to do a str check here. Naively, I tried passing in a dd.Series for by (pandas accepts this, and different parts of the dask API accept this as well, such as set_index). On main this leads to the helpful NotImplementedError, but with this change the error is much more opaque because it gets further than this validation step.

Now, it should also be possible to write this function so it takes a dd.Series, but that's probably outside the scope of this PR.

Ended up consolidating this all into the same check, with a error message that provides a general overview of the expected input and when multi-column sorting is available

ian-r-rose · 2022-02-17T00:05:20Z

dask/dataframe/tests/test_shuffle.py

-    ddf = dd.from_pandas(df, npartitions=nparts)
+    ddf = dd.from_pandas(df, npartitions=10)
+
+    with dask.config.set(scheduler="single-threaded"):


Is this configuration important? I'd expect it to work regardless of the scheduler I use, and it doesn't actually seem to take effect, since we leave the config block before ever computing.

This is here for debugging purposes, as sorting helper functions like set_partitions_pre run in parallel on a multi-threaded scheduler and it's not possible to debug each individual run of the function unless this config option is set.

You're correct in expecting this to work without the single-threaded scheduler, happy to remove if you think that's the best option here

I don't have a strong opinion on whether it should stay, though maybe a comment that it's only here for debugging purposes would be helpful

ian-r-rose

Thanks for your patience on this @charlesbluca! I'm happy with where this stands now

Move custom sort function logic to internal sort_values

faeef72

github-actions bot added the dataframe label Jan 14, 2022

charlesbluca commented Jan 14, 2022

View reviewed changes

Resolve sort kwargs variable mixup

bcb0dcf

ian-r-rose self-requested a review January 19, 2022 17:15

charlesbluca added 3 commits January 25, 2022 09:55

Merge remote-tracking branch 'upstream/main' into internal-custom-sort

c8b23ca

Modify input validation to support single-partition multi-col sorting

6148e94

Add single partition sorting test

e54c6b4

jsignell added hygiene Improve code quality and reduce maintenance overhead needs review Needs review from a contributor. labels Feb 4, 2022

ian-r-rose reviewed Feb 17, 2022

View reviewed changes

charlesbluca added 2 commits February 16, 2022 22:07

Remove nparts param from test_sort_values

bf18a46

Add string check for sort_values by-columns

7d229ae

ian-r-rose approved these changes Feb 17, 2022

View reviewed changes

Add comment explaining use of single-threaded scheduler

24933e5

jsignell removed the needs review Needs review from a contributor. label Feb 18, 2022

jsignell merged commit 2b72812 into dask:main Feb 18, 2022

		if self.npartitions == 1:
		return self.map_partitions(sort_function, **sort_kwargs)

	if not isinstance(by, str):
	# support ["a"] as input
	if isinstance(by, list) and len(by) == 1 and isinstance(by[0], str):
	by = by[0]
	else:
	raise NotImplementedError(
	"Dataframe only supports sorting by a single column which must "
	"be passed as a string or a list of a single string.\n"
	"You passed %s" % str(by)
	)

Uh oh!

Conversation

charlesbluca commented Jan 14, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scharlottej13 commented Jan 20, 2022

Uh oh!

bryanwweber commented Feb 2, 2022

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants