Skip to content

Add datetime_is_numeric to dataframe.describe#7719

Merged
jsignell merged 6 commits intodask:mainfrom
jsignell:pandas-describe
Sep 16, 2021
Merged

Add datetime_is_numeric to dataframe.describe#7719
jsignell merged 6 commits intodask:mainfrom
jsignell:pandas-describe

Conversation

@jsignell
Copy link
Member

@jsignell jsignell commented May 27, 2021

Start working on adding datetime_is_numeric to the dataframe.describe method.

Part of #7100
Closes #6434

Update, this is getting closer. The last remaining issue is that the means of np.datetime are not correct.

My current thinking is that somehow the empty rows are being counted

@jsignell jsignell changed the title Start on dataframe describe datetime_is_numeric [WIP] Add datetime_is_numeric to dataframe.describe May 27, 2021
@jsignell
Copy link
Member Author

So I've narrowed the issues with this down to summing up the integers representing dates:

import numpy as np
import pandas as pd

s = pd.Series([np.datetime64("2000-01-01"),np.datetime64("2010-01-01"),np.datetime64("2010-01-01")]* 3)
s.mean()
# Timestamp('2006-09-01 08:00:00')

# calculating the mean of the integer representation and converting back to datetime works fine
pd.to_datetime(pd.to_numeric(s).mean())
# Timestamp('2006-09-01 08:00:00')

# calculating the sum of the integer representation and then dividing by count - aka how dask implements means - does not work
pd.to_datetime(pd.to_numeric(s).sum() / s.count())
# Timestamp('1941-09-19 16:02:49.587827584')

@jsignell
Copy link
Member Author

jsignell commented May 27, 2021

@jorisvandenbossche I'm kind of stuck if you have any pointers

@jsignell jsignell closed this May 27, 2021
@jsignell jsignell reopened this May 27, 2021
@jsignell
Copy link
Member Author

jsignell commented Jun 2, 2021

Actually I think I can calculate the distributed mean by getting the mean and the count of each chunk and then multiplying by count of chunk divided by sum of the counts. I am hoping that will resolve my large integer issue.

@jsignell
Copy link
Member Author

Ok I finally decided to just punt on the mean of datetime and leave it as nan for now.

In [1]: import dask
   ...: 
   ...: dask.datasets.timeseries().reset_index(drop=False).describe(datetime_is_num
   ...: eric=True).compute()
Out[1]: 
                        timestamp            id             x             y
count                     2592000  2.592000e+06  2.592000e+06  2.592000e+06
min           2000-01-01 00:00:00  8.250000e+02 -9.999989e-01 -9.999995e-01
25%    2000-01-08 11:59:59.500000  9.790000e+02 -4.930464e-01 -4.946528e-01
50%           2000-01-15 23:59:59  1.000000e+03  7.424214e-03  7.274488e-03
75%    2000-01-23 11:59:59.500000  1.022000e+03  5.052252e-01  5.043816e-01
max           2000-01-30 23:59:59  1.163000e+03  9.999999e-01  9.999992e-01
mean                          NaN  1.000008e+03 -8.756789e-05  5.371854e-05
std                           NaN  3.164682e+01  5.773147e-01  5.772331e-01

@jsignell jsignell marked this pull request as ready for review September 15, 2021 14:53
@jsignell
Copy link
Member Author

I'm planning on merging this this week unless there are comments.

@jrbourbeau
Copy link
Member

rerun tests

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jsignell -- just a couple of minor comments

@jsignell jsignell changed the title [WIP] Add datetime_is_numeric to dataframe.describe Add datetime_is_numeric to dataframe.describe Sep 15, 2021
@jrbourbeau
Copy link
Member

rerun tests

@jsignell jsignell merged commit 0530bdb into dask:main Sep 16, 2021
@jsignell jsignell deleted the pandas-describe branch September 16, 2021 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update dataframe.describe for pandas 1.1

2 participants