Fix `percentiles_summary` with `dask_cudf` by pentschev · Pull Request #7325 · dask/dask

pentschev · 2021-03-05T13:50:23Z

Fixes issue introduced in Use nearest interpolation for creating percentiles of integer input #7305 for dask_cudf, as mentioned in Use nearest interpolation for creating percentiles of integer input #7305 (comment)
Tests added / passed
Passes black dask / flake8 dask

quasiben · 2021-03-05T14:07:38Z

dask/dataframe/partitionquantiles.py

        data = data.codes
        interpolation = "nearest"
-    elif np.issubdtype(data.dtype, np.integer):
+    elif np.issubdtype(data.dtype, np.integer) and not is_cupy_type(data):


Thanks @pentschev! Do you think it's worth adding a comment saying that "nearest" option is not supported for cupy here ?

Done in 0124273

Thanks Peter!

The Dask PR removed the following rounding step after the _percentiles call:

if interpolation == "linear" and np.issubdtype(data.dtype, np.integer): vals = np.round(vals).astype(data.dtype)

Do we still need this if the data is cupy?

I'm not sure about it, what I can say is that dask_cudf tests that were failing because of that in tonight's tests (see https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-gpu-test/CUDA=10.1,GPU_LABEL=gpu,OS=ubuntu16.04,PYTHON=3.7/1184/#showFailuresLink) pass for me locally. I'm happy to add that back if someone is confident on the correct usage here.

Got it - Makes perfect sense to use tests as a guide. My worry is that we may run into other problems if we were previously ensuring that divisions were the same dtype as the data, and now it will always be float.

Maybe, while we are already making a change, we can fix the origial issue for cupy data, and set the 0 and 100 percentiles to the actual min/max values?

To follow up on my previous comment. It may be best to do this after the _percentile call:

if is_cupy_type(data) and interpolation == "linear" and np.issubdtype(data.dtype, np.integer): vals = np.round(vals).astype(data.dtype) if qs[0] == 0: # Ensure the 0th quantile is the minimum value of the data vals[0] = data.min()

Thanks @rjzamora for the suggestion, I can confirm locally dask_cudf tests also pass with the change above, so I think it's probably good to do it. I pushed 3a9fdf5 with the changes you suggested now.

quasiben · 2021-03-05T14:18:42Z

Thanks again @pentschev . I'll merge when tests pass

jakirkham · 2021-03-05T15:30:14Z

cc @jrbourbeau

quasiben · 2021-03-05T15:38:52Z

Thanks for the reminder @jakirkham . I'll wait for @jrbourbeau to sign-off (james, you can also merge if you are good with this)

jakirkham · 2021-03-05T16:39:51Z

Looks like we need to run black

jrbourbeau · 2021-03-05T16:53:31Z

I took the liberty of running black (hope you don't mind Peter)

pentschev · 2021-03-05T16:58:24Z

Not at all, thanks @jrbourbeau !

jrbourbeau

Thanks all! Let's merge after CI passes

dask/dataframe/partitionquantiles.py

jakirkham · 2021-03-05T20:54:33Z

Thanks Peter! Also thanks everyone for the reviews! 😄

* upstream/master: (43 commits) bump version to 2021.03.0 Bump minimum version of distributed (dask#7328) Fix `percentiles_summary` with `dask_cudf` (dask#7325) Temporarily revert recent Array.__setitem__ updates (dask#7326) Blockwise.clone (dask#7312) NEP-35 duck array update (dask#7321) Don't allow setting `.name` for array (dask#7222) Use nearest interpolation for creating percentiles of integer input (dask#7305) Test `exp` with CuPy arrays (dask#7322) Check that computed chunks have right size and dtype (dask#7277) pytest.mark.flaky (dask#7319) Contributing docs: add note to pull the latest git tags before pip installing Dask (dask#7308) Support for Python 3.9 (dask#7289) Add broadcast-based merge implementation (dask#7143) Add split_every to graph_manipulation (dask#7282) Typo in optimize docs (dask#7306) dask.graph_manipulation support for xarray.Dataset (dask#7276) Add plot width and height support for Bokeh 2.3.0 (dask#7297) Add numpy functions tri, triu_indices, triu_indices_from, tril_indices, tril_indices_from (dask#6997) Remove "cleanup" task in dataframe on-disk shuffle. The partd directory (dask#7260) ...

In dask#7325 one of the assertions was left incorrect, and because Dask didn't have gpuCI back then it wasn't catched before. This fixes cuDF assertion back to what it was before dask#7305, as cuDF still has to use "linear" instead of "nearest".

pentschev added 4 commits March 5, 2021 05:43

Move _is_cupy_type to dask/utils.py

e7220ef

Avoid "nearest" in percentiles_summary for CuPy arrays

2a2afba

Test for dask_cudf in test_set_index_interpolate*

2dbf6f0

Fix black formatting

c12d1f4

This was referenced Mar 5, 2021

Use nearest interpolation for creating percentiles of integer input #7305

Merged

Test broadcast merge in local_cudf_merge benchmark rapidsai/dask-cuda#507

Merged

Fix usage of is_cupy_type

eebbc15

quasiben reviewed Mar 5, 2021

View reviewed changes

Add commnet on unsupported nearest interpolation

0124273

jakirkham mentioned this pull request Mar 5, 2021

Release 2021.03.0 dask/community#129

Closed

jakirkham approved these changes Mar 5, 2021

View reviewed changes

Adjust min/max values for the CuPy use case

3a9fdf5

kylebarron approved these changes Mar 5, 2021

View reviewed changes

Run black

ce2ff9a

jakirkham approved these changes Mar 5, 2021

View reviewed changes

jakirkham requested a review from rjzamora March 5, 2021 17:01

rjzamora approved these changes Mar 5, 2021

View reviewed changes

jrbourbeau approved these changes Mar 5, 2021

View reviewed changes

kylebarron reviewed Mar 5, 2021

View reviewed changes

dask/dataframe/partitionquantiles.py Outdated Show resolved Hide resolved

Move is_cupy_type import to top of file

fba208d

jakirkham merged commit e46a050 into dask:master Mar 5, 2021

dependabot bot mentioned this pull request Mar 8, 2021

Update dask requirement from ^2020.12.0 to ^2021.3.0 Kwonil-Kim/kkpy#8

Closed

dependabot bot mentioned this pull request Mar 12, 2021

Bump dask[bag] from 2.24.0 to 2021.3.0 admdev8/pyglotaran#52

Closed

pentschev deleted the fix-cupy-percentiles_summary branch June 30, 2021 12:27

Uh oh!

Conversation

pentschev commented Mar 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quasiben Mar 5, 2021

Choose a reason for hiding this comment

Uh oh!

pentschev Mar 5, 2021

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 5, 2021

Choose a reason for hiding this comment

Uh oh!

pentschev Mar 5, 2021

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pentschev Mar 5, 2021

Choose a reason for hiding this comment

Uh oh!

quasiben commented Mar 5, 2021

Uh oh!

jakirkham commented Mar 5, 2021

Uh oh!

quasiben commented Mar 5, 2021

Uh oh!

jakirkham commented Mar 5, 2021

Uh oh!

jrbourbeau commented Mar 5, 2021

Uh oh!

pentschev commented Mar 5, 2021

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jakirkham commented Mar 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pentschev commented Mar 5, 2021 •

edited

Loading

rjzamora Mar 5, 2021 •

edited

Loading

rjzamora Mar 5, 2021 •

edited

Loading