Skip to content

Update dataframe.describe for pandas 1.1 #6434

@TomAugspurger

Description

@TomAugspurger

In pandas 1.1, the default behavior of handling datetimes has been deprecated. Previously they were treated like categoricals (gave things like unique). In the future they'll be treated like numerics (will give things like quantiles).

In [15]: df = pd.DataFrame({"A": pd.date_range("2000", periods=2)})

In [16]: ddf = dd.from_pandas(df, npartitions=1)

In [17]: df.describe()
/Users/taugspurger/.virtualenvs/dask-dev/bin/ipython:1: FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
  #!/Users/taugspurger/Envs/dask-dev/bin/python
Out[17]:
                          A
count                     2
unique                    2
top     2000-01-01 00:00:00
freq                      1
first   2000-01-01 00:00:00
last    2000-01-02 00:00:00

In [18]: ddf.describe()
/Users/taugspurger/sandbox/dask/dask/dataframe/core.py:2230: FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
  meta = data._meta_nonempty.describe()
/Users/taugspurger/sandbox/dask/dask/dataframe/core.py:2128: FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
  meta = self._meta_nonempty.describe(include=include, exclude=exclude)
Out[18]:
Dask DataFrame Structure:
                    A
npartitions=1
               object
                  ...
Dask Name: describe, 19 tasks

In [19]: _.compute()
Out[19]:
                          A
unique                    2
count                     2
top     2000-01-02 00:00:00
freq                      1
first   2000-01-01 00:00:00
last    2000-01-02 00:00:00

To silence this warning, we'll need to use datetime_is_numeric

In [21]: df.describe(datetime_is_numeric=True)
Out[21]:
                         A
count                    2
mean   2000-01-01 12:00:00
min    2000-01-01 00:00:00
25%    2000-01-01 06:00:00
50%    2000-01-01 12:00:00
75%    2000-01-01 18:00:00
max    2000-01-02 00:00:00

I don't know if we'll want to add that to our API or not.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions