Use nearest interpolation for creating percentiles of integer input by kylebarron · Pull Request #7305 · dask/dask

kylebarron · 2021-03-02T23:22:33Z

Closes partition_quantiles finds incorrect minimum with large unsigned integers #7304
Tests added / passed
Passes black dask / flake8 dask

kylebarron · 2021-03-02T23:44:10Z

I installed this branch with pip install -e ., and this change seems to fix my issue from #7304 with the synchronous scheduler, but I don't see a change when using the distributed scheduler, even when installing that from source, but I'm guessing I did something wrong setting that up. (I used the IPython instructions)

Edit: Seems to be working with distributed now as well.

eriknw · 2021-03-03T04:06:53Z

Thanks @kylebarron, this looks great to me!

kylebarron · 2021-03-03T21:37:52Z

dask/dataframe/tests/test_shuffle.py

    d1 = d.set_index("x", npartitions=3)
    assert d1.npartitions == 3
-    assert set(d1.divisions) == set([1, 2, 3, 4])
+    assert set(d1.divisions) == set([1, 2, 4])


@eriknw do you think this is an issue?

Nope. This test is contrived and the behavior doesn't reflect a real concern.

I think it would be very hard to contrive a pathological example where this change would make a noticeable impact (and if I'm wrong about this, then I would be very interested in learning where it does make an impact!). The likely worst case scenario is other projects will need to update a couple tests that depended on the old behavior (but even this may be a stretch).

Everything LGTM.

I currently don't have write permission to this repo, so somebody else will need to merge.

jsignell · 2021-03-04T21:51:14Z

Thanks you @kylebarron for opening this and @eriknw for reviewing!

rjzamora · 2021-03-05T03:43:14Z

This PR seems to be causing CI failures in dask-cuda cc @jakirkham (no failures before this commit)

jakirkham · 2021-03-05T03:51:37Z

This PR seems to be causing CI failures in dask-cuda cc @jakirkham (no failures before this commit)

cc @jrbourbeau (for vis)

rjzamora · 2021-03-05T03:52:52Z

Actually - It looks like this PR is completely breaking dask_cudf, because the "nearest" option is not supported for cupy percentile. I don't think we can choose "nearest" as a default here.

jakirkham · 2021-03-05T03:57:11Z

Thanks Rick! Also mentioned this over on the release issue ( dask/community#129 ). Would changing the default be sufficient for addressing the issue?

rjzamora · 2021-03-05T04:47:37Z

Would changing the default be sufficient for addressing the issue?

It seems that the whole point of this PR was to change the default for integer types to be “nearest” (to fix some correctness issues). I’ll need to investigate tomorrow morning if we can fix the original issue without breaking cupy.

pentschev · 2021-03-05T13:51:33Z

I'm attempting to fix the issue mentioned above in #7325 .

jsignell · 2021-03-05T15:28:57Z

Oof sorry everyone! It didn't occur to me to check with cupy folks.

pentschev · 2021-03-05T18:39:40Z

Oof sorry everyone! It didn't occur to me to check with cupy folks.

Yeah, we need better testing in Dask. I think @quasiben is actively looking for HW so that CuPy tests can run during PRs as too, although I think this wouldn't have been catched anyway, as I don't think there were tests covering this. 🙂

* upstream/master: (43 commits) bump version to 2021.03.0 Bump minimum version of distributed (dask#7328) Fix `percentiles_summary` with `dask_cudf` (dask#7325) Temporarily revert recent Array.__setitem__ updates (dask#7326) Blockwise.clone (dask#7312) NEP-35 duck array update (dask#7321) Don't allow setting `.name` for array (dask#7222) Use nearest interpolation for creating percentiles of integer input (dask#7305) Test `exp` with CuPy arrays (dask#7322) Check that computed chunks have right size and dtype (dask#7277) pytest.mark.flaky (dask#7319) Contributing docs: add note to pull the latest git tags before pip installing Dask (dask#7308) Support for Python 3.9 (dask#7289) Add broadcast-based merge implementation (dask#7143) Add split_every to graph_manipulation (dask#7282) Typo in optimize docs (dask#7306) dask.graph_manipulation support for xarray.Dataset (dask#7276) Add plot width and height support for Bokeh 2.3.0 (dask#7297) Add numpy functions tri, triu_indices, triu_indices_from, tril_indices, tril_indices_from (dask#6997) Remove "cleanup" task in dataframe on-disk shuffle. The partd directory (dask#7260) ...

In dask#7325 one of the assertions was left incorrect, and because Dask didn't have gpuCI back then it wasn't catched before. This fixes cuDF assertion back to what it was before dask#7305, as cuDF still has to use "linear" instead of "nearest".

Use nearest interpolation for integer input

3a2094a

kylebarron changed the title ~~Use nearest interpolation for integer input~~ Use nearest interpolation for creating percentiles of integer input Mar 2, 2021

kylebarron added 2 commits March 2, 2021 16:51

Fix test

1db8b36

Add test

de953a4

kylebarron marked this pull request as ready for review March 3, 2021 01:59

kylebarron commented Mar 3, 2021

View reviewed changes

eriknw approved these changes Mar 3, 2021

View reviewed changes

jsignell merged commit c927731 into dask:master Mar 4, 2021

jakirkham mentioned this pull request Mar 5, 2021

Release 2021.03.0 dask/community#129

Closed

jakirkham mentioned this pull request Mar 5, 2021

Test broadcast merge in local_cudf_merge benchmark rapidsai/dask-cuda#507

Merged

pentschev mentioned this pull request Mar 5, 2021

Fix percentiles_summary with dask_cudf #7325

Merged

3 tasks

dependabot bot mentioned this pull request Mar 8, 2021

Update dask requirement from ^2020.12.0 to ^2021.3.0 Kwonil-Kim/kkpy#8

Closed

dependabot bot mentioned this pull request Mar 12, 2021

Bump dask[bag] from 2.24.0 to 2021.3.0 admdev8/pyglotaran#52

Closed

Uh oh!

Conversation

kylebarron commented Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylebarron commented Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eriknw commented Mar 3, 2021

Uh oh!

kylebarron Mar 3, 2021

Choose a reason for hiding this comment

Uh oh!

eriknw Mar 3, 2021

Choose a reason for hiding this comment

Uh oh!

jsignell commented Mar 4, 2021

Uh oh!

rjzamora commented Mar 5, 2021

Uh oh!

jakirkham commented Mar 5, 2021

Uh oh!

rjzamora commented Mar 5, 2021

Uh oh!

jakirkham commented Mar 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Mar 5, 2021

Uh oh!

pentschev commented Mar 5, 2021

Uh oh!

jsignell commented Mar 5, 2021

Uh oh!

pentschev commented Mar 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kylebarron commented Mar 2, 2021 •

edited

Loading

kylebarron commented Mar 2, 2021 •

edited

Loading

jakirkham commented Mar 5, 2021 •

edited

Loading