Add extension dtype support to `set_index` by jrbourbeau · Pull Request #9566 · dask/dask

jrbourbeau · 2022-10-11T21:01:04Z

Still very much a WIP, but wanted to push up for early feedback

Closes #5720, closes #8011, closes #9118

mrocklin · 2022-10-12T20:16:20Z

Woo 🎉

dask/dataframe/partitionquantiles.py

jrbourbeau

Alright, this is probably ready for some eyes at this point. @ian-r-rose I'd be curious what your thoughts are whenever you get a few minutes

jrbourbeau · 2022-10-12T20:37:39Z

dask/dataframe/partitionquantiles.py

+    try:
+        vals = data.quantile(q=qs / 100, interpolation=interpolation)
+    except TypeError:
+        vals, _ = _percentile(array_safe(data, data.dtype), qs, interpolation)


gpuCI isn't happy about this array_safe casting, but I've got a feeling this will be a straightforward fix

What is the goal of array_safe here? My understanding is that data is a pd/cudf.DataFrame here. Is the problem that data.values sometimes fails?

I don't think .values is quite what we want here as (at least on the pandas side) when extension dtypes are involved, .values returns a pandas extension array instead of a numpy ndarray

In [1]: import pandas as pd In [2]: df = pd.DataFrame({"a": range(10), "b": range(10, 20)}) In [3]: df = df.astype({"a": "Int64"}) In [4]: df.b.values Out[4]: array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]) In [5]: df.a.values Out[5]: <IntegerArray> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] Length: 10, dtype: Int64

I'd be curious to see what the equivalent cudf code looks like for the above example ^

You bring up a good point though. Depending on what cudf is doing, an alternative approach to the array_safe casting could be to teach _percentile how to handle extension arrays

diff --git a/dask/dataframe/backends.py b/dask/dataframe/backends.py index f26b5603f..25c46d762 100644 --- a/dask/dataframe/backends.py +++ b/dask/dataframe/backends.py @@ -559,6 +559,11 @@ def percentile(a, q, interpolation="linear"): return _percentile(a, q, interpolation) +@percentile_lookup.register(pd.api.extensions.ExtensionArray) +def _percentile_pandas_extension_array(a, q, interpolation="linear"): + return _percentile(a.to_numpy(), q, interpolation) + + ###################################### # cuDF: Pandas Dataframes on the GPU # ###################################### diff --git a/dask/dataframe/partitionquantiles.py b/dask/dataframe/partitionquantiles.py index 128405a2d..47c4076d4 100644 --- a/dask/dataframe/partitionquantiles.py +++ b/dask/dataframe/partitionquantiles.py @@ -423,7 +423,7 @@ def percentiles_summary(df, num_old, num_new, upsample, state): try: vals = data.quantile(q=qs / 100, interpolation=interpolation).values except (TypeError, NotImplementedError): - vals, _ = _percentile(array_safe(data, data.dtype), qs, interpolation) + vals, _ = _percentile(data.values, qs, interpolation) if ( is_cupy_type(data)

Is this discussion resolved, or do you want to try out the dispatch-based approach here?

I'd worry a bit about using to_numpy() naively in all cases, as that would create object dtype arrays in places where pandas-optimized routines would probably be better.

I believe it's been resolved. I'm happy with the current set of changes -- was just throwing @percentile_lookup.register(pd.api.extensions.ExtensionArray) out there as a possible thing we could do

dask/dataframe/tests/test_shuffle.py

ian-r-rose · 2022-10-12T21:49:44Z

dask/dataframe/tests/test_shuffle.py

+            [1567703790155681, 0],
            [1567703791155681, 1],
            [1567703792155681, 2],
-            [1567703790155681, 0],


This is odd -- it seems that datetimes are not being properly sorted for us units? Is there an upstream pandas issue, or is it on the dask side?

Okay, so after digging in here it appears this behavior is due to a discrepancy between using np.percentile and Series.quantile when interpolation="linear" which is, I believe, a bug in pandas. I've filed an upstream issue here pandas-dev/pandas#49110.

As a workaround, I've updated the interpolation method to be "nearest" for datetime data, which doesn't have the discrepancy (i.e. Series.quantile == np.percentile). We were already using "nearest" for timezone-aware datetimes and have switched from "linear" to "nearest" interpolation in the past when using "linear" has been problematic. I've reverted the changes to test_set_index_datetime_precision, which now passes (along with the rest of the test suite).

dask/dataframe/tests/test_shuffle.py

dask/dataframe/partitionquantiles.py

…tension-dtypes

jrbourbeau · 2022-10-15T02:13:50Z

Alright, I think this PR is good to go (or at least ready for another review). @ian-r-rose would you mind taking a look whenever you get a chance?

ian-r-rose

Thanks @jrbourbeau, this looks good to me! Just one non-blocking comment/question.

ian-r-rose · 2022-10-18T18:42:18Z

dask/dataframe/partitionquantiles.py

+    try:
+        vals = data.quantile(q=qs / 100, interpolation=interpolation)
+    except TypeError:
+        vals, _ = _percentile(array_safe(data, data.dtype), qs, interpolation)


Is this discussion resolved, or do you want to try out the dispatch-based approach here?

I'd worry a bit about using to_numpy() naively in all cases, as that would create object dtype arrays in places where pandas-optimized routines would probably be better.

Add extension dtype support to set_index

5714e35

github-actions bot added the dataframe label Oct 11, 2022

jrbourbeau added 5 commits October 11, 2022 19:40

Updates

8e9258f

Skip extension dtype tests for pandas < 1.4.0

dd9cf66

Generalize quantile method handling

5d27c4b

Make GPUs happy?

28038e8

Test fixup

dde2033

jrbourbeau changed the title ~~[WIP] Add extension dtype support to set_index~~ Add extension dtype support to set_index Oct 12, 2022

Add string[pyarrow] support

f6a0803

rjzamora reviewed Oct 12, 2022

View reviewed changes

dask/dataframe/partitionquantiles.py Outdated Show resolved Hide resolved

Roll back pyarrow-specific exception

b235eec

jrbourbeau commented Oct 12, 2022

View reviewed changes

Try Rick's suggestion

fb8fdcd

ian-r-rose reviewed Oct 12, 2022

View reviewed changes

jrbourbeau added 5 commits October 12, 2022 17:14

Extend test coverage to object

542ccea

Go back to localize

c91fb75

Merge branch 'main' of https://github.com/dask/dask into set_index-ex…

f7afdd7

…tension-dtypes

Use nearest interpolation

cf9334e

Revert change to test_set_index_datetime_precision

5fa0f5a

ian-r-rose approved these changes Oct 18, 2022

View reviewed changes

jrbourbeau merged commit 43fede5 into dask:main Oct 18, 2022

jrbourbeau deleted the set_index-extension-dtypes branch October 18, 2022 20:39

Uh oh!

Conversation

jrbourbeau commented Oct 11, 2022

Uh oh!

mrocklin commented Oct 12, 2022

Uh oh!

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Oct 12, 2022

Choose a reason for hiding this comment

Uh oh!

rjzamora Oct 12, 2022

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Oct 12, 2022

Choose a reason for hiding this comment

Uh oh!

ian-r-rose Oct 18, 2022

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Oct 18, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ian-r-rose Oct 12, 2022

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Oct 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbourbeau commented Oct 15, 2022

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

ian-r-rose Oct 18, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jrbourbeau Oct 15, 2022 •

edited

Loading