Skip to content

Fix percentiles_summary with dask_cudf#7325

Merged
jakirkham merged 9 commits intodask:masterfrom
pentschev:fix-cupy-percentiles_summary
Mar 5, 2021
Merged

Fix percentiles_summary with dask_cudf#7325
jakirkham merged 9 commits intodask:masterfrom
pentschev:fix-cupy-percentiles_summary

Conversation

@pentschev
Copy link
Member

@pentschev pentschev commented Mar 5, 2021

data = data.codes
interpolation = "nearest"
elif np.issubdtype(data.dtype, np.integer):
elif np.issubdtype(data.dtype, np.integer) and not is_cupy_type(data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pentschev! Do you think it's worth adding a comment saying that "nearest" option is not supported for cupy here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0124273

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Peter!

The Dask PR removed the following rounding step after the _percentiles call:

    if interpolation == "linear" and np.issubdtype(data.dtype, np.integer):
        vals = np.round(vals).astype(data.dtype)

Do we still need this if the data is cupy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about it, what I can say is that dask_cudf tests that were failing because of that in tonight's tests (see https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-gpu-test/CUDA=10.1,GPU_LABEL=gpu,OS=ubuntu16.04,PYTHON=3.7/1184/#showFailuresLink) pass for me locally. I'm happy to add that back if someone is confident on the correct usage here.

Copy link
Member

@rjzamora rjzamora Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - Makes perfect sense to use tests as a guide. My worry is that we may run into other problems if we were previously ensuring that divisions were the same dtype as the data, and now it will always be float.

Maybe, while we are already making a change, we can fix the origial issue for cupy data, and set the 0 and 100 percentiles to the actual min/max values?

Copy link
Member

@rjzamora rjzamora Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow up on my previous comment. It may be best to do this after the _percentile call:

if is_cupy_type(data) and interpolation == "linear" and np.issubdtype(data.dtype, np.integer):
    vals = np.round(vals).astype(data.dtype)
    if qs[0] == 0:
        # Ensure the 0th quantile is the minimum value of the data
        vals[0] = data.min()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora for the suggestion, I can confirm locally dask_cudf tests also pass with the change above, so I think it's probably good to do it. I pushed 3a9fdf5 with the changes you suggested now.

@quasiben
Copy link
Member

quasiben commented Mar 5, 2021

Thanks again @pentschev . I'll merge when tests pass

@jakirkham
Copy link
Member

cc @jrbourbeau

@quasiben
Copy link
Member

quasiben commented Mar 5, 2021

Thanks for the reminder @jakirkham . I'll wait for @jrbourbeau to sign-off (james, you can also merge if you are good with this)

@jakirkham
Copy link
Member

Looks like we need to run black

@jrbourbeau
Copy link
Member

I took the liberty of running black (hope you don't mind Peter)

@pentschev
Copy link
Member Author

Not at all, thanks @jrbourbeau !

@jakirkham jakirkham requested a review from rjzamora March 5, 2021 17:01
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all! Let's merge after CI passes

@jakirkham jakirkham merged commit e46a050 into dask:master Mar 5, 2021
@jakirkham
Copy link
Member

Thanks Peter! Also thanks everyone for the reviews! 😄

dcherian added a commit to dcherian/dask that referenced this pull request Mar 8, 2021
* upstream/master: (43 commits)
  bump version to 2021.03.0
  Bump minimum version of distributed (dask#7328)
  Fix `percentiles_summary` with `dask_cudf` (dask#7325)
  Temporarily revert recent Array.__setitem__ updates (dask#7326)
  Blockwise.clone (dask#7312)
  NEP-35 duck array update (dask#7321)
  Don't allow setting `.name` for array (dask#7222)
  Use nearest interpolation for creating percentiles of integer input (dask#7305)
  Test `exp` with CuPy arrays (dask#7322)
  Check that computed chunks have right size and dtype (dask#7277)
  pytest.mark.flaky (dask#7319)
  Contributing docs: add note to pull the latest git tags before pip installing Dask (dask#7308)
  Support for Python 3.9 (dask#7289)
  Add broadcast-based merge implementation (dask#7143)
  Add split_every to graph_manipulation (dask#7282)
  Typo in optimize docs (dask#7306)
  dask.graph_manipulation support for xarray.Dataset (dask#7276)
  Add plot width and height support for Bokeh 2.3.0 (dask#7297)
  Add numpy functions tri, triu_indices, triu_indices_from, tril_indices, tril_indices_from (dask#6997)
  Remove "cleanup" task in dataframe on-disk shuffle. The partd directory (dask#7260)
  ...
@pentschev pentschev deleted the fix-cupy-percentiles_summary branch June 30, 2021 12:27
pentschev added a commit to pentschev/dask that referenced this pull request Aug 3, 2021
In dask#7325 one of the assertions was left
incorrect, and because Dask didn't have gpuCI back then it wasn't
catched before. This fixes cuDF assertion back to what it was before
dask#7305, as cuDF still has to use
"linear" instead of "nearest".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants