Add extension dtype support to set_index#9566
Conversation
|
Woo 🎉 |
set_indexset_index
jrbourbeau
left a comment
There was a problem hiding this comment.
Alright, this is probably ready for some eyes at this point. @ian-r-rose I'd be curious what your thoughts are whenever you get a few minutes
| try: | ||
| vals = data.quantile(q=qs / 100, interpolation=interpolation) | ||
| except TypeError: | ||
| vals, _ = _percentile(array_safe(data, data.dtype), qs, interpolation) |
There was a problem hiding this comment.
gpuCI isn't happy about this array_safe casting, but I've got a feeling this will be a straightforward fix
There was a problem hiding this comment.
What is the goal of array_safe here? My understanding is that data is a pd/cudf.DataFrame here. Is the problem that data.values sometimes fails?
There was a problem hiding this comment.
I don't think .values is quite what we want here as (at least on the pandas side) when extension dtypes are involved, .values returns a pandas extension array instead of a numpy ndarray
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"a": range(10), "b": range(10, 20)})
In [3]: df = df.astype({"a": "Int64"})
In [4]: df.b.values
Out[4]: array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
In [5]: df.a.values
Out[5]:
<IntegerArray>
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Length: 10, dtype: Int64I'd be curious to see what the equivalent cudf code looks like for the above example ^
You bring up a good point though. Depending on what cudf is doing, an alternative approach to the array_safe casting could be to teach _percentile how to handle extension arrays
diff --git a/dask/dataframe/backends.py b/dask/dataframe/backends.py
index f26b5603f..25c46d762 100644
--- a/dask/dataframe/backends.py
+++ b/dask/dataframe/backends.py
@@ -559,6 +559,11 @@ def percentile(a, q, interpolation="linear"):
return _percentile(a, q, interpolation)
+@percentile_lookup.register(pd.api.extensions.ExtensionArray)
+def _percentile_pandas_extension_array(a, q, interpolation="linear"):
+ return _percentile(a.to_numpy(), q, interpolation)
+
+
######################################
# cuDF: Pandas Dataframes on the GPU #
######################################
diff --git a/dask/dataframe/partitionquantiles.py b/dask/dataframe/partitionquantiles.py
index 128405a2d..47c4076d4 100644
--- a/dask/dataframe/partitionquantiles.py
+++ b/dask/dataframe/partitionquantiles.py
@@ -423,7 +423,7 @@ def percentiles_summary(df, num_old, num_new, upsample, state):
try:
vals = data.quantile(q=qs / 100, interpolation=interpolation).values
except (TypeError, NotImplementedError):
- vals, _ = _percentile(array_safe(data, data.dtype), qs, interpolation)
+ vals, _ = _percentile(data.values, qs, interpolation)
if (
is_cupy_type(data)There was a problem hiding this comment.
Is this discussion resolved, or do you want to try out the dispatch-based approach here?
I'd worry a bit about using to_numpy() naively in all cases, as that would create object dtype arrays in places where pandas-optimized routines would probably be better.
There was a problem hiding this comment.
I believe it's been resolved. I'm happy with the current set of changes -- was just throwing @percentile_lookup.register(pd.api.extensions.ExtensionArray) out there as a possible thing we could do
| [1567703790155681, 0], | ||
| [1567703791155681, 1], | ||
| [1567703792155681, 2], | ||
| [1567703790155681, 0], |
There was a problem hiding this comment.
This is odd -- it seems that datetimes are not being properly sorted for us units? Is there an upstream pandas issue, or is it on the dask side?
There was a problem hiding this comment.
Okay, so after digging in here it appears this behavior is due to a discrepancy between using np.percentile and Series.quantile when interpolation="linear" which is, I believe, a bug in pandas. I've filed an upstream issue here pandas-dev/pandas#49110.
As a workaround, I've updated the interpolation method to be "nearest" for datetime data, which doesn't have the discrepancy (i.e. Series.quantile == np.percentile). We were already using "nearest" for timezone-aware datetimes and have switched from "linear" to "nearest" interpolation in the past when using "linear" has been problematic. I've reverted the changes to test_set_index_datetime_precision, which now passes (along with the rest of the test suite).
|
Alright, I think this PR is good to go (or at least ready for another review). @ian-r-rose would you mind taking a look whenever you get a chance? |
ian-r-rose
left a comment
There was a problem hiding this comment.
Thanks @jrbourbeau, this looks good to me! Just one non-blocking comment/question.
| try: | ||
| vals = data.quantile(q=qs / 100, interpolation=interpolation) | ||
| except TypeError: | ||
| vals, _ = _percentile(array_safe(data, data.dtype), qs, interpolation) |
There was a problem hiding this comment.
Is this discussion resolved, or do you want to try out the dispatch-based approach here?
I'd worry a bit about using to_numpy() naively in all cases, as that would create object dtype arrays in places where pandas-optimized routines would probably be better.
Still very much a WIP, but wanted to push up for early feedback
Closes #5720, closes #8011, closes #9118