Skip to content

ENH: da.ufuncs now supports DataFrame/Series#1669

Merged
mrocklin merged 8 commits intodask:masterfrom
sinhrks:dask_ufunc
Nov 3, 2016
Merged

ENH: da.ufuncs now supports DataFrame/Series#1669
mrocklin merged 8 commits intodask:masterfrom
sinhrks:dask_ufunc

Conversation

@sinhrks
Copy link
Member

@sinhrks sinhrks commented Oct 16, 2016

On current master, da.ufuncs doesn't work with dask.dataframe instances.

On current master:

import dask.array as da
import dask.dataframe as dd

s = pd.Series(np.random.randn(100))
ds = dd.from_pandas(s, 3)
da.sin(ds)
# ValueError: max() arg is an empty sequence

After the PR:

Now these return dask instances.

da.sin(ds)
# dd.Series<sin-cf1..., npartitions=3, divisions=(0, 34, 68, 99)>

da.sin(ds).compute()
#0    -0.329255
#1    -0.586025
#         ...   
#98    0.739737
#99    0.567551
# dtype: float64

To make impl simple, I'm adding _elemwise property to Array/_Frame which returns function used for element-wise op.

  • Add __array__ interface to DataFrame
    • __array_wrap__ with non-default Index
  • Tests
    • ufunc which returns array
    • ufunc with 2 arrays

def wrapped(x, *args, **kwargs):
if hasattr(x, '_elemwise_'):
return x._elemwise_(numpy_ufunc, x, *args, **kwargs)
# return x._elemwise_(numpy_ufunc, x, *args, **kwargs,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current test all succeeded even if it ignores wrap_elemwise args (like logical_and = wrap_elemwise(np.logical_and, dtype='bool')). Can I remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, perhaps we've gotten better at determining dtypes since when this code was originally written.

@mrocklin
Copy link
Member

cc @shoyer this seems like something you might find interesting

@mrocklin
Copy link
Member

This all looks pretty good to me. +1

@sinhrks sinhrks force-pushed the dask_ufunc branch 2 times, most recently from 73799c4 to dc93c1c Compare October 19, 2016 21:38
@mrocklin
Copy link
Member

This has fallen behind master. I would like to make sure that this gets in, even if it does not support all of the ufuncs.

@sinhrks if I were to rebase and merge this in is there anything that would be seriously lacking? While I would like to have frexp and modf I don't think that they should stop us from merging in.

@sinhrks sinhrks force-pushed the dask_ufunc branch 3 times, most recently from 6e1e617 to 24e320b Compare October 29, 2016 03:47
@sinhrks sinhrks changed the title (WIP)ENH: da.ufuncs now supports DataFrame/Series ENH: da.ufuncs now supports DataFrame/Series Oct 29, 2016
@sinhrks
Copy link
Member Author

sinhrks commented Oct 29, 2016

@mrocklin OK. Split frexp and modf for DataFrame to #1723, which may needs little internal cleanup.

Now this PR is ready for review.

@mrocklin
Copy link
Member

Two questions:

  1. How do we get the dtype of the output series? Does some small computation happen? Are we using ._meta?
  2. What type does da.ufunc(dd.Index) return?

@sinhrks
Copy link
Member Author

sinhrks commented Oct 29, 2016

  1. dtype is inferred via _meta, like
np.sin(pd.Series([], dtype=np.int64))
# Series([], dtype: float64)
  1. fixed to return dd.Index if possible. Can't work if result has bool dtype, because pd.Index returns np.ndarray rather than pd.Index (will fix in da.ufunc DataFrame support follow-up #1723)

@mrocklin
Copy link
Member

Merging soon if no further comments

# We don't inspect the values of 0d dask arrays, because these could
# hold potentially very expensive calculations.
vals = [np.empty((1,) * a.ndim, dtype=a.dtype)
vals = [np.empty((1, ) * a.ndim, dtype=a.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think adding this extra white-space is an improvement

return x.__array_wrap__(numpy_ufunc(x, *args, **kwargs))


def wrap_elemwise(numpy_ufunc, array_wrap=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will pickle properly, which will likely be an annoyance later (see pydata/xarray#901).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the soln should be, make a callable class which wraps ufuncs like methodcaller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do a fair amount of this. I agree that it's not optimal but at least these are dask.array functions, rather than functions that we're trying to send over the network. Actually, I appreciate that wrapped calls Array._elemwise with the only well defined easily pickleable functions (__array_wrap__ and the provided numpy_ufunc).

I think that it's nice to avoid dynamic functions when convenient but that, in this case, it seems like the right choice. Other alternatives seem more complex. If we really wanted to ensure picklability across the project then we might consider something like toolz.curry.

Also, for what it's worth, I've started to think of cloudpickle as the default serialization solution for Python functions.

def wrapped(x, *args, **kwargs):
if hasattr(x, '_elemwise'):
if array_wrap:
return x._elemwise(__array_wrap__, numpy_ufunc,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is __array_wrap__ defined for dask.array? Why exactly do you need this, only for some ufuncs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for letting some ufuncs which not calls __array_wraps__ handles DataFrame.

np.isreal(pd.Series([1, 2, 3]))
# array([ True,  True,  True], dtype=bool)

@mrocklin
Copy link
Member

mrocklin commented Nov 2, 2016

I'm going to try to release in the next two days. I may take on this PR tomorrow if you don't mind @sinhrks .

@sinhrks
Copy link
Member Author

sinhrks commented Nov 3, 2016

@mrocklin Thx for the support. Pls fix it if you already have an idea in your mind:)

@mrocklin
Copy link
Member

mrocklin commented Nov 3, 2016

To me it looks like all raised issues are handled. The serialization issue is long-standing and something we would need to fix throughout the codebase in many places. The __array_wrap__ question seems to have a reasonable answer. I'd like to merge this. I'll wait a few more hours to make sure that @shoyer 's time zone has woken up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants