Add `datetime_is_numeric` to dataframe.describe by jsignell · Pull Request #7719 · dask/dask

jsignell · 2021-05-27T16:02:35Z

Start working on adding datetime_is_numeric to the dataframe.describe method.

Part of #7100
Closes #6434

Update, this is getting closer. The last remaining issue is that the means of np.datetime are not correct.

My current thinking is that somehow the empty rows are being counted

jsignell · 2021-05-27T21:18:52Z

So I've narrowed the issues with this down to summing up the integers representing dates:

import numpy as np
import pandas as pd

s = pd.Series([np.datetime64("2000-01-01"),np.datetime64("2010-01-01"),np.datetime64("2010-01-01")]* 3)
s.mean()
# Timestamp('2006-09-01 08:00:00')

# calculating the mean of the integer representation and converting back to datetime works fine
pd.to_datetime(pd.to_numeric(s).mean())
# Timestamp('2006-09-01 08:00:00')

# calculating the sum of the integer representation and then dividing by count - aka how dask implements means - does not work
pd.to_datetime(pd.to_numeric(s).sum() / s.count())
# Timestamp('1941-09-19 16:02:49.587827584')

jsignell · 2021-05-27T21:19:20Z

@jorisvandenbossche I'm kind of stuck if you have any pointers

jsignell · 2021-06-02T13:19:28Z

Actually I think I can calculate the distributed mean by getting the mean and the count of each chunk and then multiplying by count of chunk divided by sum of the counts. I am hoping that will resolve my large integer issue.

jsignell · 2021-09-15T14:53:26Z

Ok I finally decided to just punt on the mean of datetime and leave it as nan for now.

In [1]: import dask
   ...: 
   ...: dask.datasets.timeseries().reset_index(drop=False).describe(datetime_is_num
   ...: eric=True).compute()
Out[1]: 
                        timestamp            id             x             y
count                     2592000  2.592000e+06  2.592000e+06  2.592000e+06
min           2000-01-01 00:00:00  8.250000e+02 -9.999989e-01 -9.999995e-01
25%    2000-01-08 11:59:59.500000  9.790000e+02 -4.930464e-01 -4.946528e-01
50%           2000-01-15 23:59:59  1.000000e+03  7.424214e-03  7.274488e-03
75%    2000-01-23 11:59:59.500000  1.022000e+03  5.052252e-01  5.043816e-01
max           2000-01-30 23:59:59  1.163000e+03  9.999999e-01  9.999992e-01
mean                          NaN  1.000008e+03 -8.756789e-05  5.371854e-05
std                           NaN  3.164682e+01  5.773147e-01  5.772331e-01

jsignell · 2021-09-15T15:46:59Z

I'm planning on merging this this week unless there are comments.

jrbourbeau · 2021-09-15T16:36:42Z

rerun tests

jrbourbeau

Thanks @jsignell -- just a couple of minor comments

dask/dataframe/core.py

dask/dataframe/tests/test_dataframe.py

jrbourbeau · 2021-09-16T14:35:38Z

rerun tests

github-actions bot added the dataframe label May 27, 2021

jsignell mentioned this pull request May 27, 2021

Follow-up compatibility issues with pandas #7100

Closed

7 tasks

jsignell changed the title ~~Start on dataframe describe datetime_is_numeric~~ [WIP] Add datetime_is_numeric to dataframe.describe May 27, 2021

jsignell force-pushed the pandas-describe branch from 641649a to 54e7569 Compare May 27, 2021 20:36

jsignell closed this May 27, 2021

jsignell reopened this May 27, 2021

jsignell mentioned this pull request Jun 18, 2021

WIP: update dask dataframe describe method #7290

Closed

3 tasks

jsignell mentioned this pull request Jul 28, 2021

Mean implementation for datetime series #5794

Open

2 tasks

jsignell added 4 commits September 15, 2021 10:00

Start on dataframe describe datetime_is_numeric

f7cc8ef

Change test to use datetime_is_numeric

f537513

Make series behave as expected, check describe with no args

e6001d0

Don't include mean for datetimes

f300cfc

jsignell force-pushed the pandas-describe branch from 54e7569 to f300cfc Compare September 15, 2021 14:48

jsignell marked this pull request as ready for review September 15, 2021 14:53

Make sure that it works with pandas < 1.1.0

6477de2

jrbourbeau reviewed Sep 15, 2021

View reviewed changes

dask/dataframe/core.py Outdated Show resolved Hide resolved

dask/dataframe/core.py Show resolved Hide resolved

dask/dataframe/tests/test_dataframe.py Show resolved Hide resolved

Respond to comments

dced508

jsignell changed the title ~~[WIP] Add datetime_is_numeric to dataframe.describe~~ Add datetime_is_numeric to dataframe.describe Sep 15, 2021

jsignell mentioned this pull request Sep 16, 2021

Pandas & Dask mean/std result disperancy for timedelta columns #6811

Open

jrbourbeau mentioned this pull request Sep 16, 2021

git clone issue in gpuCI build #8154

Closed

jsignell merged commit 0530bdb into dask:main Sep 16, 2021

jsignell deleted the pandas-describe branch September 16, 2021 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `datetime_is_numeric` to dataframe.describe#7719

Add `datetime_is_numeric` to dataframe.describe#7719
jsignell merged 6 commits intodask:mainfrom
jsignell:pandas-describe

jsignell commented May 27, 2021 •

edited

Loading

Uh oh!

jsignell commented May 27, 2021

Uh oh!

jsignell commented May 27, 2021 •

edited

Loading

Uh oh!

jsignell commented Jun 2, 2021

Uh oh!

jsignell commented Sep 15, 2021

Uh oh!

jsignell commented Sep 15, 2021

Uh oh!

jrbourbeau commented Sep 15, 2021

Uh oh!

jrbourbeau left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbourbeau commented Sep 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jsignell commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsignell commented May 27, 2021

Uh oh!

jsignell commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsignell commented Jun 2, 2021

Uh oh!

jsignell commented Sep 15, 2021

Uh oh!

jsignell commented Sep 15, 2021

Uh oh!

jrbourbeau commented Sep 15, 2021

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jrbourbeau commented Sep 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jsignell commented May 27, 2021 •

edited

Loading

jsignell commented May 27, 2021 •

edited

Loading