Add std() support for datetime64 dtype for Pandas-like objects (#8214)#8523
Add std() support for datetime64 dtype for Pandas-like objects (#8214)#8523
Conversation
|
Can one of the admins verify this patch? |
|
Just checking in on the review here. Did you have a chance to look at it @ian-r-rose? |
ian-r-rose
left a comment
There was a problem hiding this comment.
Thanks for working on this @bglossner, this is a nice piece of work!
dask/dataframe/core.py
Outdated
| v = numeric_dd.var(skipna=skipna, ddof=ddof, split_every=split_every) | ||
| name = self._token_prefix + "std" | ||
|
|
||
| def sqrt_and_convert(p): |
There was a problem hiding this comment.
Here and a couple of places above: it's my understanding that core Dask generally avoids using closures and lambdas for performance reasons. Could we move this into a module-private function like _sqrt_and_convert_to_timedelta? Variables in the closure like is_df_like would probably need to be args in that case.
dask/dataframe/core.py
Outdated
| from .io import from_pandas | ||
| from .numeric import to_numeric | ||
|
|
||
| def convert_to_numeric(s): |
There was a problem hiding this comment.
I see that you at some point experimented with using view("i8"), but couldn't because it was not implemented yet. There is currently an open PR in #8533 adding it. Would that help to simplify things? It might also improve performance, since it would locally be a copy-free operation.
There was a problem hiding this comment.
Unfortunately, it seems that view has the same numeric value for NaT, and skipping nans is a feature of TimeDeltaArray.
There was a problem hiding this comment.
The performance improvement would at least be nice but I agree in that it doesn't solve the issue that would have been helpful here. I am fine with waiting until that PR is merged and refactoring after that to use view, since it won't be much of a change!
dask/dataframe/core.py
Outdated
| _raise_if_object_series(self, "std") | ||
|
|
||
| meta = self._meta_nonempty.std(axis=axis, skipna=skipna) | ||
| is_df_like = is_dataframe_like(self._meta_nonempty) |
There was a problem hiding this comment.
Any reason to use _meta_nonempty instead of _meta here?
There was a problem hiding this comment.
Nope, I didn't know it made a difference. Is there a reason to use one over the other?
dask/dataframe/core.py
Outdated
| return to_numeric(s).mask(s.isnull(), np.nan) | ||
|
|
||
| if is_df_like: | ||
| time_cols = self._meta_nonempty.select_dtypes( |
There was a problem hiding this comment.
As above, can you do this with self._meta?
There was a problem hiding this comment.
For this one, I just assumed _meta_nonempty to be a safer choice, since the columns in it are guaranteed to show up in the result. However, it should never make a difference since datetime is supported in general now so _meta and _meta_nonempty should both include datetime columns. So I guess my question is the same as above for why to use each.
dask/dataframe/core.py
Outdated
| # If there's different types, just convert everything to NaNs for the time columns | ||
| if axis == 1 and len(time_cols) != len(self.columns): | ||
| # This is faster that converting each columns to numeric when it's not necessary | ||
| numeric_dd[time_cols.tolist()] = from_pandas( |
There was a problem hiding this comment.
I don't think you need to tolist(), I think you can index with time_cols directly
| # Single column always results in NaT | ||
| assert_eq(ddf[["dt1"]].std(axis=1), pdf[["dt1"]].std(axis=1)) | ||
|
|
||
| # Mix of datetimes with other numeric types produces NaNs |
There was a problem hiding this comment.
This would be a useful comment in the implementation, I was confused at first why you were creating a manual dataframe
| assert_eq(ddf2.std(axis=1, skipna=True), pdf2.std(axis=1, skipna=True)) | ||
| assert_eq(ddf2.std(axis=1, skipna=False), pdf2.std(axis=1, skipna=False)) |
There was a problem hiding this comment.
I might suggest using pytest.parametrize over axis and skipna in these tests rather than repeating variations on assert_eq
dask/dataframe/core.py
Outdated
| # Convert timedelta and datetime columns to integer types so we can use var | ||
| for col in time_cols: | ||
| numeric_dd[col] = convert_to_numeric(numeric_dd[col]) | ||
| else: |
There was a problem hiding this comment.
An aside, and others might disagree with me here: I find this deeply nested branching kind of tough to read. Specifically: what "if" block is this attached to? Obviously I can scroll up and down, and some editor support can help here, but to me it might help with legibility to add some small comments like
else: # not is_df_likeThere was a problem hiding this comment.
I'm wondering if this might actually be more legible if the conditionals were collapsed, so it would be:
if PANDAS_GT_120 and is_df_like:
....
elif PANDAS_GT_120 and not is_df_like:
...|
Also, sorry for the delay on the review! |
12ed7c0 to
7366e53
Compare
7366e53 to
b94fe1d
Compare
|
Is there a good way to rerun the tests without another commit (although I don't think any of the tests are failing from my change)? Also, before it gets merged, do I need to rebase everything into a single commit in a different branch? Let me know what you think of the new changes, thanks! |
jsignell
left a comment
There was a problem hiding this comment.
I left a few small comments, but to answer your question, no you don't have to merge main or tidy up your commits, we will squash-merge when this is done so it all ends up as on commit anyways. If you want to trigger CI you can just push an empty commit, but our tests have been having some chronic issues (particularly windows 3.7) so as long as the PR doesn't introduce new failures, we can still merge it.
dask/dataframe/core.py
Outdated
| # Convert timedelta and datetime columns to integer types so we can use var | ||
| for col in time_cols: | ||
| numeric_dd[col] = convert_to_numeric(numeric_dd[col]) | ||
| else: |
There was a problem hiding this comment.
I'm wondering if this might actually be more legible if the conditionals were collapsed, so it would be:
if PANDAS_GT_120 and is_df_like:
....
elif PANDAS_GT_120 and not is_df_like:
...
dask/dataframe/core.py
Outdated
| ) | ||
|
|
||
| if is_df_like: | ||
| result = result.astype(meta.dtype) |
There was a problem hiding this comment.
This feels a little strange to me. Is meta guaranteed to have a dtype?
There was a problem hiding this comment.
Not sure :/. Since this function is @derived_from(pd.DataFrame), is the dtype from it not included? meta is created from _meta_nonempty._std() which I think outputs a Series? I am honestly still not sure how to get the resulting dtypes to match other than to do this.
One thing I can do is check if meta has the dtype attribute at least before doing this.
There was a problem hiding this comment.
This has been included in the new commit!
|
Hmm I just tried this out locally and I ran into an issue. I suspect it is because import dask
ddf = dask.datasets.timeseries(freq="1H")
ddf.index.std()---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [5], in <module>
----> 1 ddf.index.std()
File ~/dask/dask/dataframe/core.py:96, in _numeric_only.<locals>.wrapper(self, *args, **kwargs)
94 elif kwargs.get("numeric_only") is True:
95 self = self._get_numeric_data()
---> 96 return func(self, *args, **kwargs)
File ~/dask/dask/dataframe/core.py:2214, in _Frame.std(self, axis, skipna, ddof, split_every, dtype, out, numeric_only)
2211 return handle_out(out, result)
2213 # Case where axis=0 or axis=None
-> 2214 v = numeric_dd.var(skipna=skipna, ddof=ddof, split_every=split_every)
2215 name = self._token_prefix + "std"
2217 if needs_time_conversion:
TypeError: var() got an unexpected keyword argument 'skipna' |
Oof. Testing this out and it's getting messy. An Index is not series like or dataframe like (at least returns
My thinking here is that we can either convert things that are arraylike() to a series inside the conversion to a numeric and run through the computation, since it will output the same thing OR we can skip the One other thing to consider here is that perhaps we should just treat index.std() as calls to dask array std() instead. Because dask array std() still doesn't support datetimes, another task would be made to add support there where this would be done. I am not sure if this would involve changing the derived_from or adding another decorator or just simply converting it to a Dask Array and calling std() on it. Thoughts? |
|
I actually think it might be a bug in pandas that makes view return an array rather than an index object. But I tried changing from |
|
Sorry, not sure if this is waiting on changes from me or another reviewer based on the new label. I pushed some changes for the NotImplemented issue before those labels were added. |
|
Thanks for the ping @bglossner! This is just waiting on me or @ian-r-rose to give it a last look over. I am expecting it to get in this week. |
|
ok to test |
jsignell
left a comment
There was a problem hiding this comment.
This looks good! I just have one idea about how to tidy up the pre-existing test. Let me know if you want me to just push to your branch @bglossner
|
Thanks @bglossner - this is in!! |
Adds support for calling
std()on Dask Series and DataFrames when they include datetime64 dtypes.std()for thedatetime64[ns]dtype #8214pre-commit run --all-files