Add datetime_is_numeric to dataframe.describe#7719
Merged
Conversation
7 tasks
datetime_is_numeric to dataframe.describe
Member
Author
|
So I've narrowed the issues with this down to summing up the integers representing dates: import numpy as np
import pandas as pd
s = pd.Series([np.datetime64("2000-01-01"),np.datetime64("2010-01-01"),np.datetime64("2010-01-01")]* 3)
s.mean()
# Timestamp('2006-09-01 08:00:00')
# calculating the mean of the integer representation and converting back to datetime works fine
pd.to_datetime(pd.to_numeric(s).mean())
# Timestamp('2006-09-01 08:00:00')
# calculating the sum of the integer representation and then dividing by count - aka how dask implements means - does not work
pd.to_datetime(pd.to_numeric(s).sum() / s.count())
# Timestamp('1941-09-19 16:02:49.587827584') |
Member
Author
|
@jorisvandenbossche I'm kind of stuck if you have any pointers |
Member
Author
|
Actually I think I can calculate the distributed mean by getting the mean and the count of each chunk and then multiplying by count of chunk divided by sum of the counts. I am hoping that will resolve my large integer issue. |
3 tasks
2 tasks
54e7569 to
f300cfc
Compare
Member
Author
|
Ok I finally decided to just punt on the mean of datetime and leave it as nan for now. In [1]: import dask
...:
...: dask.datasets.timeseries().reset_index(drop=False).describe(datetime_is_num
...: eric=True).compute()
Out[1]:
timestamp id x y
count 2592000 2.592000e+06 2.592000e+06 2.592000e+06
min 2000-01-01 00:00:00 8.250000e+02 -9.999989e-01 -9.999995e-01
25% 2000-01-08 11:59:59.500000 9.790000e+02 -4.930464e-01 -4.946528e-01
50% 2000-01-15 23:59:59 1.000000e+03 7.424214e-03 7.274488e-03
75% 2000-01-23 11:59:59.500000 1.022000e+03 5.052252e-01 5.043816e-01
max 2000-01-30 23:59:59 1.163000e+03 9.999999e-01 9.999992e-01
mean NaN 1.000008e+03 -8.756789e-05 5.371854e-05
std NaN 3.164682e+01 5.773147e-01 5.772331e-01 |
Member
Author
|
I'm planning on merging this this week unless there are comments. |
Member
|
rerun tests |
jrbourbeau
reviewed
Sep 15, 2021
Member
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @jsignell -- just a couple of minor comments
datetime_is_numeric to dataframe.describedatetime_is_numeric to dataframe.describe
Member
|
rerun tests |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Start working on adding
datetime_is_numericto thedataframe.describemethod.Part of #7100
Closes #6434
Update, this is getting closer. The last remaining issue is that the means of
np.datetimeare not correct.My current thinking is that somehow the empty rows are being counted