Skip to content

Add extension dtype support to set_index#9566

Merged
jrbourbeau merged 14 commits intodask:mainfrom
jrbourbeau:set_index-extension-dtypes
Oct 18, 2022
Merged

Add extension dtype support to set_index#9566
jrbourbeau merged 14 commits intodask:mainfrom
jrbourbeau:set_index-extension-dtypes

Conversation

@jrbourbeau
Copy link
Member

Still very much a WIP, but wanted to push up for early feedback

Closes #5720, closes #8011, closes #9118

@mrocklin
Copy link
Member

Woo 🎉

@jrbourbeau jrbourbeau changed the title [WIP] Add extension dtype support to set_index Add extension dtype support to set_index Oct 12, 2022
Copy link
Member Author

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, this is probably ready for some eyes at this point. @ian-r-rose I'd be curious what your thoughts are whenever you get a few minutes

try:
vals = data.quantile(q=qs / 100, interpolation=interpolation)
except TypeError:
vals, _ = _percentile(array_safe(data, data.dtype), qs, interpolation)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpuCI isn't happy about this array_safe casting, but I've got a feeling this will be a straightforward fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the goal of array_safe here? My understanding is that data is a pd/cudf.DataFrame here. Is the problem that data.values sometimes fails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think .values is quite what we want here as (at least on the pandas side) when extension dtypes are involved, .values returns a pandas extension array instead of a numpy ndarray

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"a": range(10), "b": range(10, 20)})

In [3]: df = df.astype({"a": "Int64"})

In [4]: df.b.values
Out[4]: array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [5]: df.a.values
Out[5]:
<IntegerArray>
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Length: 10, dtype: Int64

I'd be curious to see what the equivalent cudf code looks like for the above example ^

You bring up a good point though. Depending on what cudf is doing, an alternative approach to the array_safe casting could be to teach _percentile how to handle extension arrays

diff --git a/dask/dataframe/backends.py b/dask/dataframe/backends.py
index f26b5603f..25c46d762 100644
--- a/dask/dataframe/backends.py
+++ b/dask/dataframe/backends.py
@@ -559,6 +559,11 @@ def percentile(a, q, interpolation="linear"):
     return _percentile(a, q, interpolation)


+@percentile_lookup.register(pd.api.extensions.ExtensionArray)
+def _percentile_pandas_extension_array(a, q, interpolation="linear"):
+    return _percentile(a.to_numpy(), q, interpolation)
+
+
 ######################################
 # cuDF: Pandas Dataframes on the GPU #
 ######################################
diff --git a/dask/dataframe/partitionquantiles.py b/dask/dataframe/partitionquantiles.py
index 128405a2d..47c4076d4 100644
--- a/dask/dataframe/partitionquantiles.py
+++ b/dask/dataframe/partitionquantiles.py
@@ -423,7 +423,7 @@ def percentiles_summary(df, num_old, num_new, upsample, state):
     try:
         vals = data.quantile(q=qs / 100, interpolation=interpolation).values
     except (TypeError, NotImplementedError):
-        vals, _ = _percentile(array_safe(data, data.dtype), qs, interpolation)
+        vals, _ = _percentile(data.values, qs, interpolation)

     if (
         is_cupy_type(data)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this discussion resolved, or do you want to try out the dispatch-based approach here?

I'd worry a bit about using to_numpy() naively in all cases, as that would create object dtype arrays in places where pandas-optimized routines would probably be better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's been resolved. I'm happy with the current set of changes -- was just throwing @percentile_lookup.register(pd.api.extensions.ExtensionArray) out there as a possible thing we could do

[1567703790155681, 0],
[1567703791155681, 1],
[1567703792155681, 2],
[1567703790155681, 0],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is odd -- it seems that datetimes are not being properly sorted for us units? Is there an upstream pandas issue, or is it on the dask side?

Copy link
Member Author

@jrbourbeau jrbourbeau Oct 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so after digging in here it appears this behavior is due to a discrepancy between using np.percentile and Series.quantile when interpolation="linear" which is, I believe, a bug in pandas. I've filed an upstream issue here pandas-dev/pandas#49110.

As a workaround, I've updated the interpolation method to be "nearest" for datetime data, which doesn't have the discrepancy (i.e. Series.quantile == np.percentile). We were already using "nearest" for timezone-aware datetimes and have switched from "linear" to "nearest" interpolation in the past when using "linear" has been problematic. I've reverted the changes to test_set_index_datetime_precision, which now passes (along with the rest of the test suite).

@jrbourbeau
Copy link
Member Author

Alright, I think this PR is good to go (or at least ready for another review). @ian-r-rose would you mind taking a look whenever you get a chance?

Copy link
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jrbourbeau, this looks good to me! Just one non-blocking comment/question.

try:
vals = data.quantile(q=qs / 100, interpolation=interpolation)
except TypeError:
vals, _ = _percentile(array_safe(data, data.dtype), qs, interpolation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this discussion resolved, or do you want to try out the dispatch-based approach here?

I'd worry a bit about using to_numpy() naively in all cases, as that would create object dtype arrays in places where pandas-optimized routines would probably be better.

@jrbourbeau jrbourbeau merged commit 43fede5 into dask:main Oct 18, 2022
@jrbourbeau jrbourbeau deleted the set_index-extension-dtypes branch October 18, 2022 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

4 participants