Skip to content

Use nearest interpolation for creating percentiles of integer input#7305

Merged
jsignell merged 3 commits intodask:masterfrom
kylebarron:kyle/integer-percentile
Mar 4, 2021
Merged

Use nearest interpolation for creating percentiles of integer input#7305
jsignell merged 3 commits intodask:masterfrom
kylebarron:kyle/integer-percentile

Conversation

@kylebarron
Copy link
Contributor

@kylebarron kylebarron commented Mar 2, 2021

@kylebarron kylebarron changed the title Use nearest interpolation for integer input Use nearest interpolation for creating percentiles of integer input Mar 2, 2021
@kylebarron
Copy link
Contributor Author

kylebarron commented Mar 2, 2021

I installed this branch with pip install -e ., and this change seems to fix my issue from #7304 with the synchronous scheduler, but I don't see a change when using the distributed scheduler, even when installing that from source, but I'm guessing I did something wrong setting that up. (I used the IPython instructions)

Edit: Seems to be working with distributed now as well.

@kylebarron kylebarron marked this pull request as ready for review March 3, 2021 01:59
@eriknw
Copy link
Member

eriknw commented Mar 3, 2021

Thanks @kylebarron, this looks great to me!

d1 = d.set_index("x", npartitions=3)
assert d1.npartitions == 3
assert set(d1.divisions) == set([1, 2, 3, 4])
assert set(d1.divisions) == set([1, 2, 4])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eriknw do you think this is an issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. This test is contrived and the behavior doesn't reflect a real concern.

I think it would be very hard to contrive a pathological example where this change would make a noticeable impact (and if I'm wrong about this, then I would be very interested in learning where it does make an impact!). The likely worst case scenario is other projects will need to update a couple tests that depended on the old behavior (but even this may be a stretch).

Everything LGTM.

I currently don't have write permission to this repo, so somebody else will need to merge.

@jsignell
Copy link
Member

jsignell commented Mar 4, 2021

Thanks you @kylebarron for opening this and @eriknw for reviewing!

@jsignell jsignell merged commit c927731 into dask:master Mar 4, 2021
@rjzamora
Copy link
Member

rjzamora commented Mar 5, 2021

This PR seems to be causing CI failures in dask-cuda cc @jakirkham (no failures before this commit)

@jakirkham
Copy link
Member

This PR seems to be causing CI failures in dask-cuda cc @jakirkham (no failures before this commit)

cc @jrbourbeau (for vis)

@rjzamora
Copy link
Member

rjzamora commented Mar 5, 2021

Actually - It looks like this PR is completely breaking dask_cudf, because the "nearest" option is not supported for cupy percentile. I don't think we can choose "nearest" as a default here.

@jakirkham
Copy link
Member

jakirkham commented Mar 5, 2021

Thanks Rick! Also mentioned this over on the release issue ( dask/community#129 ). Would changing the default be sufficient for addressing the issue?

@rjzamora
Copy link
Member

rjzamora commented Mar 5, 2021

Would changing the default be sufficient for addressing the issue?

It seems that the whole point of this PR was to change the default for integer types to be “nearest” (to fix some correctness issues). I’ll need to investigate tomorrow morning if we can fix the original issue without breaking cupy.

@pentschev
Copy link
Member

I'm attempting to fix the issue mentioned above in #7325 .

@jsignell
Copy link
Member

jsignell commented Mar 5, 2021

Oof sorry everyone! It didn't occur to me to check with cupy folks.

@pentschev
Copy link
Member

Oof sorry everyone! It didn't occur to me to check with cupy folks.

Yeah, we need better testing in Dask. I think @quasiben is actively looking for HW so that CuPy tests can run during PRs as too, although I think this wouldn't have been catched anyway, as I don't think there were tests covering this. 🙂

dcherian added a commit to dcherian/dask that referenced this pull request Mar 8, 2021
* upstream/master: (43 commits)
  bump version to 2021.03.0
  Bump minimum version of distributed (dask#7328)
  Fix `percentiles_summary` with `dask_cudf` (dask#7325)
  Temporarily revert recent Array.__setitem__ updates (dask#7326)
  Blockwise.clone (dask#7312)
  NEP-35 duck array update (dask#7321)
  Don't allow setting `.name` for array (dask#7222)
  Use nearest interpolation for creating percentiles of integer input (dask#7305)
  Test `exp` with CuPy arrays (dask#7322)
  Check that computed chunks have right size and dtype (dask#7277)
  pytest.mark.flaky (dask#7319)
  Contributing docs: add note to pull the latest git tags before pip installing Dask (dask#7308)
  Support for Python 3.9 (dask#7289)
  Add broadcast-based merge implementation (dask#7143)
  Add split_every to graph_manipulation (dask#7282)
  Typo in optimize docs (dask#7306)
  dask.graph_manipulation support for xarray.Dataset (dask#7276)
  Add plot width and height support for Bokeh 2.3.0 (dask#7297)
  Add numpy functions tri, triu_indices, triu_indices_from, tril_indices, tril_indices_from (dask#6997)
  Remove "cleanup" task in dataframe on-disk shuffle. The partd directory (dask#7260)
  ...
pentschev added a commit to pentschev/dask that referenced this pull request Aug 3, 2021
In dask#7325 one of the assertions was left
incorrect, and because Dask didn't have gpuCI back then it wasn't
catched before. This fixes cuDF assertion back to what it was before
dask#7305, as cuDF still has to use
"linear" instead of "nearest".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

partition_quantiles finds incorrect minimum with large unsigned integers

6 participants