Fix percentiles_summary with dask_cudf#7325
Conversation
| data = data.codes | ||
| interpolation = "nearest" | ||
| elif np.issubdtype(data.dtype, np.integer): | ||
| elif np.issubdtype(data.dtype, np.integer) and not is_cupy_type(data): |
There was a problem hiding this comment.
Thanks @pentschev! Do you think it's worth adding a comment saying that "nearest" option is not supported for cupy here ?
There was a problem hiding this comment.
Thanks Peter!
The Dask PR removed the following rounding step after the _percentiles call:
if interpolation == "linear" and np.issubdtype(data.dtype, np.integer):
vals = np.round(vals).astype(data.dtype)Do we still need this if the data is cupy?
There was a problem hiding this comment.
I'm not sure about it, what I can say is that dask_cudf tests that were failing because of that in tonight's tests (see https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-gpu-test/CUDA=10.1,GPU_LABEL=gpu,OS=ubuntu16.04,PYTHON=3.7/1184/#showFailuresLink) pass for me locally. I'm happy to add that back if someone is confident on the correct usage here.
There was a problem hiding this comment.
Got it - Makes perfect sense to use tests as a guide. My worry is that we may run into other problems if we were previously ensuring that divisions were the same dtype as the data, and now it will always be float.
Maybe, while we are already making a change, we can fix the origial issue for cupy data, and set the 0 and 100 percentiles to the actual min/max values?
There was a problem hiding this comment.
To follow up on my previous comment. It may be best to do this after the _percentile call:
if is_cupy_type(data) and interpolation == "linear" and np.issubdtype(data.dtype, np.integer):
vals = np.round(vals).astype(data.dtype)
if qs[0] == 0:
# Ensure the 0th quantile is the minimum value of the data
vals[0] = data.min()|
Thanks again @pentschev . I'll merge when tests pass |
|
cc @jrbourbeau |
|
Thanks for the reminder @jakirkham . I'll wait for @jrbourbeau to sign-off (james, you can also merge if you are good with this) |
|
Looks like we need to run |
|
I took the liberty of running |
|
Not at all, thanks @jrbourbeau ! |
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks all! Let's merge after CI passes
|
Thanks Peter! Also thanks everyone for the reviews! 😄 |
* upstream/master: (43 commits) bump version to 2021.03.0 Bump minimum version of distributed (dask#7328) Fix `percentiles_summary` with `dask_cudf` (dask#7325) Temporarily revert recent Array.__setitem__ updates (dask#7326) Blockwise.clone (dask#7312) NEP-35 duck array update (dask#7321) Don't allow setting `.name` for array (dask#7222) Use nearest interpolation for creating percentiles of integer input (dask#7305) Test `exp` with CuPy arrays (dask#7322) Check that computed chunks have right size and dtype (dask#7277) pytest.mark.flaky (dask#7319) Contributing docs: add note to pull the latest git tags before pip installing Dask (dask#7308) Support for Python 3.9 (dask#7289) Add broadcast-based merge implementation (dask#7143) Add split_every to graph_manipulation (dask#7282) Typo in optimize docs (dask#7306) dask.graph_manipulation support for xarray.Dataset (dask#7276) Add plot width and height support for Bokeh 2.3.0 (dask#7297) Add numpy functions tri, triu_indices, triu_indices_from, tril_indices, tril_indices_from (dask#6997) Remove "cleanup" task in dataframe on-disk shuffle. The partd directory (dask#7260) ...
black dask/flake8 dask