Skip to content

NEP-35 duck array update#7321

Merged
jrbourbeau merged 5 commits intodask:masterfrom
jrbourbeau:nep-35-fixup
Mar 5, 2021
Merged

NEP-35 duck array update#7321
jrbourbeau merged 5 commits intodask:masterfrom
jrbourbeau:nep-35-fixup

Conversation

@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Mar 4, 2021

This tests out adding an __array_function__ condition to our is_arraylike check (xref #7263 (comment)). Let's see if CI passes

cc @pentschev @jakirkham

@jakirkham
Copy link
Member

cc @shoyer

@jrbourbeau
Copy link
Member Author

Just to summarize the test failures we're seeing:

  • The mindeps build is failing because we support versions of NumPy that predate the addition of __array_function__ which causes is_arraylike to return False on things like NumPy arrays
  • There is one test, dask/dataframe/tests/test_dataframe.py::test_dask_dataframe_holds_scipy_sparse_containers which is failing because scipy.sparse arrays don't implement __array_function__. FWIW I tracked this test back to Test that dask collections can hold scipy.sparse arrays #3738 where it looks like we don't fully support scipy.sparse arrays inside Dask arrays, but it is useful to partially support for some Dask-ML work (hand chunks off to dask.delayed or da.map_blocks workflows)

@jakirkham
Copy link
Member

We also support CuPy sparse matrices for similar reasons, which likely have similar issues

@pentschev
Copy link
Member

Thanks @jrbourbeau , you beat me to it on working this PR. 😄

  • The mindeps build is failing because we support versions of NumPy that predate the addition of __array_function__ which causes is_arraylike to return False on things like NumPy arrays

For this, we'll also want to check for

IS_NEP18_ACTIVE = _is_nep18_active()
.

Seems like we'll need to special-case this as well in is_arraylike, it doesn't look like we have a much better alternative.

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions for CuPy sparse matrices

"""
from .base import is_dask_collection

is_duck_array = hasattr(x, "__array_function__") or hasattr(x, "__array_ufunc__")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pentschev do you have thoughts on this formulation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref-ing @pentschev's earlier comment #7321 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duck arrays I'm aware of (i.e. NumPy, sparse, cupy, pint, unyt) all will result in is_duck_array being True. That said, I can't be 100% certain that this covers all array-like libraries that are used with Dask today. Perhaps this is a good opportunity for us to specify the conditions for which an array library will work with Dask Array. Xarray has a nice documentation section on this http://xarray.pydata.org/en/stable/internals.html#internals-duck-arrays topic

@jrbourbeau jrbourbeau changed the title [WIP] NEP-35 update NEP-35 duck array update Mar 5, 2021
@jrbourbeau
Copy link
Member Author

If there are no further comments I'll propose we include this in the release tomorrow. @jakirkham @pentschev could I ask one of you to run the cupy tests with the changes here to ensure they pass?

@pentschev
Copy link
Member

If there are no further comments I'll propose we include this in the release tomorrow. @jakirkham @pentschev could I ask one of you to run the cupy tests with the changes here to ensure they pass?

I agree with changes here and tested it locally, the tests pass with the exception of those I mentioned in #7324 . I also applied the patch to a start just after #6738 and by doing so all tests would pass, so from the CuPy side I thing we're good.

Thanks @jrbourbeau for working on this! 😄

@shoyer
Copy link
Member

shoyer commented Mar 5, 2021

Thanks for working on this -- yes, this is important to keep Xarray working with dask :)

My only thought is that the name is_arraylike() is not quite appropriate, given that NumPy uses "array like" for a different meaning. I would suggest using the name is_duck_array() instead.

@pentschev
Copy link
Member

My only thought is that the name is_arraylike() is not quite appropriate, given that NumPy uses "array like" for a different meaning. I would suggest using the name is_duck_array() instead.

I do not oppose doing that, and semantically agree with it. I'm only wondering if renaming that may cause confusion with people, we also have other utility function such as is_dataframe_like and is_series_like with similar meaning, but perhaps renaming and adding that information to docstrings would suffice?

@jakirkham
Copy link
Member

Given this is a bug fix we are trying to get in before a release and I don’t think this function is intended for usage outside Dask, maybe we can punt on function renaming and track that in a new issue?

@jrbourbeau
Copy link
Member Author

maybe we can punt on function renaming and track that in a new issue?

+1. @shoyer thanks for bringing this terminology issue up -- I agree there's a mismatch between the is_arraylike method and the term "array-like". Let's discuss in a follow-up issue

@jrbourbeau
Copy link
Member Author

Thanks all for reviewing! Your feedback was very useful

@jrbourbeau jrbourbeau merged commit fa6203c into dask:master Mar 5, 2021
@jrbourbeau jrbourbeau deleted the nep-35-fixup branch March 5, 2021 16:48
@jakirkham
Copy link
Member

Thanks James! 😄

@shoyer
Copy link
Member

shoyer commented Mar 5, 2021 via email

@jakirkham
Copy link
Member

Filed issue ( #7327 ) on that point. Also noted updating the docstring. Please feel free to add anything else there 🙂

dcherian added a commit to dcherian/dask that referenced this pull request Mar 8, 2021
* upstream/master: (43 commits)
  bump version to 2021.03.0
  Bump minimum version of distributed (dask#7328)
  Fix `percentiles_summary` with `dask_cudf` (dask#7325)
  Temporarily revert recent Array.__setitem__ updates (dask#7326)
  Blockwise.clone (dask#7312)
  NEP-35 duck array update (dask#7321)
  Don't allow setting `.name` for array (dask#7222)
  Use nearest interpolation for creating percentiles of integer input (dask#7305)
  Test `exp` with CuPy arrays (dask#7322)
  Check that computed chunks have right size and dtype (dask#7277)
  pytest.mark.flaky (dask#7319)
  Contributing docs: add note to pull the latest git tags before pip installing Dask (dask#7308)
  Support for Python 3.9 (dask#7289)
  Add broadcast-based merge implementation (dask#7143)
  Add split_every to graph_manipulation (dask#7282)
  Typo in optimize docs (dask#7306)
  dask.graph_manipulation support for xarray.Dataset (dask#7276)
  Add plot width and height support for Bokeh 2.3.0 (dask#7297)
  Add numpy functions tri, triu_indices, triu_indices_from, tril_indices, tril_indices_from (dask#6997)
  Remove "cleanup" task in dataframe on-disk shuffle. The partd directory (dask#7260)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants