Use nearest interpolation for creating percentiles of integer input#7305
Use nearest interpolation for creating percentiles of integer input#7305jsignell merged 3 commits intodask:masterfrom
Conversation
|
I installed this branch with Edit: Seems to be working with distributed now as well. |
|
Thanks @kylebarron, this looks great to me! |
| d1 = d.set_index("x", npartitions=3) | ||
| assert d1.npartitions == 3 | ||
| assert set(d1.divisions) == set([1, 2, 3, 4]) | ||
| assert set(d1.divisions) == set([1, 2, 4]) |
There was a problem hiding this comment.
Nope. This test is contrived and the behavior doesn't reflect a real concern.
I think it would be very hard to contrive a pathological example where this change would make a noticeable impact (and if I'm wrong about this, then I would be very interested in learning where it does make an impact!). The likely worst case scenario is other projects will need to update a couple tests that depended on the old behavior (but even this may be a stretch).
Everything LGTM.
I currently don't have write permission to this repo, so somebody else will need to merge.
|
Thanks you @kylebarron for opening this and @eriknw for reviewing! |
|
This PR seems to be causing CI failures in dask-cuda cc @jakirkham (no failures before this commit) |
cc @jrbourbeau (for vis) |
|
Actually - It looks like this PR is completely breaking dask_cudf, because the "nearest" option is not supported for cupy percentile. I don't think we can choose "nearest" as a default here. |
|
Thanks Rick! Also mentioned this over on the release issue ( dask/community#129 ). Would changing the default be sufficient for addressing the issue? |
It seems that the whole point of this PR was to change the default for integer types to be “nearest” (to fix some correctness issues). I’ll need to investigate tomorrow morning if we can fix the original issue without breaking cupy. |
|
I'm attempting to fix the issue mentioned above in #7325 . |
|
Oof sorry everyone! It didn't occur to me to check with cupy folks. |
Yeah, we need better testing in Dask. I think @quasiben is actively looking for HW so that CuPy tests can run during PRs as too, although I think this wouldn't have been catched anyway, as I don't think there were tests covering this. 🙂 |
* upstream/master: (43 commits) bump version to 2021.03.0 Bump minimum version of distributed (dask#7328) Fix `percentiles_summary` with `dask_cudf` (dask#7325) Temporarily revert recent Array.__setitem__ updates (dask#7326) Blockwise.clone (dask#7312) NEP-35 duck array update (dask#7321) Don't allow setting `.name` for array (dask#7222) Use nearest interpolation for creating percentiles of integer input (dask#7305) Test `exp` with CuPy arrays (dask#7322) Check that computed chunks have right size and dtype (dask#7277) pytest.mark.flaky (dask#7319) Contributing docs: add note to pull the latest git tags before pip installing Dask (dask#7308) Support for Python 3.9 (dask#7289) Add broadcast-based merge implementation (dask#7143) Add split_every to graph_manipulation (dask#7282) Typo in optimize docs (dask#7306) dask.graph_manipulation support for xarray.Dataset (dask#7276) Add plot width and height support for Bokeh 2.3.0 (dask#7297) Add numpy functions tri, triu_indices, triu_indices_from, tril_indices, tril_indices_from (dask#6997) Remove "cleanup" task in dataframe on-disk shuffle. The partd directory (dask#7260) ...
partition_quantilesfinds incorrect minimum with large unsigned integers #7304black dask/flake8 dask