Test pandas 1.1.x / 1.2.0 releases and pandas nightly by jorisvandenbossche · Pull Request #6996 · dask/dask

jorisvandenbossche · 2020-12-19T12:10:19Z

No description provided.

TomAugspurger · 2020-12-19T19:56:07Z

Thanks for starting this, I've been a bit busy lately :)

jsignell · 2020-12-22T16:32:14Z

I fixed some of the obvious ones. Mind if I push?

jorisvandenbossche · 2020-12-22T16:59:20Z

Great, feel free to push!

Some notes about the failures I already investigated up to now:

I identified one regression in pandas related to merge -> REGR: pandas 1.2rc fails merge with AttributeError: 'bool' object has no attribute 'all' pandas-dev/pandas#38616
The failures related to reduction ufuncs with np.prod seems to be caused by the fact that we now execute this column-wise (or actually dtype block-wise) by implementing DataFrame.__array_ufunc__ (REGR: pandas 1.2rc fails merge with AttributeError: 'bool' object has no attribute 'all' pandas-dev/pandas#38616), and np.prod overflows on integer columns, while before those integer columns were cast to float if there was also a float column present (as is the case in the test), not having this overlfow issue.
This is probably an acceptable side effect of implementing DataFrame.__array_ufunc__, in which case the test need to be updated.

jsignell · 2020-12-22T20:05:28Z

I was just reading the Future Warrnings and trying to fix up any that I could.

dask/dataframe/tests/test_indexing.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche · 2020-12-22T20:31:02Z

The following failure is due to pandas-dev/pandas#28507, comparing tz-naive and tz-aware timestamp no longer raises an error, but returns False:

 ___________________________ test_set_index_timezone ____________________________

    def test_set_index_timezone():
        s_naive = pd.Series(pd.date_range("20130101", periods=3))
        s_aware = pd.Series(pd.date_range("20130101", periods=3, tz="US/Eastern"))
        df = pd.DataFrame({"tz": s_aware, "notz": s_naive})
        d = dd.from_pandas(df, 2)
    
        d1 = d.set_index("notz", npartitions=1)
        s1 = pd.DatetimeIndex(s_naive.values, dtype=s_naive.dtype)
        assert d1.divisions[0] == s_naive[0] == s1[0]
        assert d1.divisions[-1] == s_naive[2] == s1[2]
    
        # We currently lose "freq".  Converting data with pandas-defined dtypes
        # to numpy or pure Python can be lossy like this.
        d2 = d.set_index("tz", npartitions=1)
        s2 = pd.DatetimeIndex(s_aware, dtype=s_aware.dtype)
        assert d2.divisions[0] == s2[0]
        assert d2.divisions[-1] == s2[2]
        assert d2.divisions[0].tz == s2[0].tz
        assert d2.divisions[0].tz is not None
        s2badtype = pd.DatetimeIndex(s_aware.values, dtype=s_naive.dtype)
        with pytest.raises(TypeError):
>           d2.divisions[0] == s2badtype[0]
E           Failed: DID NOT RAISE <class 'TypeError'>

dask/dataframe/tests/test_shuffle.py:650: Failed

jorisvandenbossche · 2020-12-22T21:08:04Z

For the failing dask/dataframe/tests/test_arithmetics_reduction.py::test_reductions_non_numeric_dtypes, that's because std() is now implemented for datetime dtype.

jsignell · 2020-12-22T21:10:33Z

For the failing dask/dataframe/tests/test_arithmetics_reduction.py::test_reductions_non_numeric_dtypes, that's because std() is now implemented for datetime dtype.

It seems reasonable to me to just skip that test for this case until someone gets around to implementing it. The dask implementation depends on var, which pandas doesn't yet have for datetime.

diff --git a/dask/dataframe/tests/test_arithmetics_reduction.py b/dask/dataframe/tests/test_arithmetics_reduction.py
index c04d3c07..2758de23 100644
--- a/dask/dataframe/tests/test_arithmetics_reduction.py
+++ b/dask/dataframe/tests/test_arithmetics_reduction.py
@@ -6,7 +6,7 @@ import numpy as np
 import pandas as pd
 
 import dask.dataframe as dd
-from dask.dataframe._compat import PANDAS_GT_100, PANDAS_VERSION
+from dask.dataframe._compat import PANDAS_GT_100, PANDAS_GT_120, PANDAS_VERSION
 from dask.dataframe.utils import (
     assert_eq,
     assert_dask_graph,
@@ -1002,7 +1002,12 @@ def test_reductions_non_numeric_dtypes():
         assert_eq(dds.min(), pds.min())
         assert_eq(dds.max(), pds.max())
         assert_eq(dds.count(), pds.count())
-        check_raises(dds, pds, "std")
+        if PANDAS_GT_120 and pds.dtype == "datetime64[ns]":
+            # std is implemented for datetimes in pandas 1.2.0, but dask
+            # implementation depends on var which isn't
+            pass
+        else:
+            check_raises(dds, pds, "std")
         check_raises(dds, pds, "var")
         check_raises(dds, pds, "sem")
         check_raises(dds, pds, "skew")

jorisvandenbossche · 2020-12-22T21:12:55Z

It seems reasonable to me to just skip that test for this case until someone gets around to implementing it. The dask implementation depends on var, which pandas doesn't yet have for datetime.

Indeed, that sounds best. We should open an issue for that on the pandas side.

The parquet failures are a bit strange: I can reproduce them locally in an environment with pandas 1.2.0rc, but it seems I also get the same failure with the latest stable pandas (but here on CI there is no such failure). So there might be an interaction with another library version in play, don't directly see it.

jsignell · 2020-12-22T21:16:04Z

The parquet failures are a bit strange

I have been seeing the parquet failures locally for a while. I can't tell if it is related to pandas version or not. I don't think dask has been doing any tests against pandas > 1.0.* except for the upstream ones which have been failing for a while #6148

jorisvandenbossche · 2020-12-22T21:22:28Z

Ah, indeed, it's already failing on pandas 1.1, but passing on pandas 1.0. So basically writing a partitioned parquet dataset where the partition column is categorical dtype is completely broken.

I don't think dask has been doing any tests against pandas > 1.0.*

Ai .. pandas 1.1 is a half year old, so we probably should have been testing that ..

jorisvandenbossche · 2020-12-22T21:35:39Z

I opened pandas-dev/pandas#38642 for this on the pandas side.

jsignell · 2020-12-22T21:58:17Z

So basically writing a partitioned parquet dataset where the partition column is categorical dtype is completely broken.

Maybe we should xfail those tests for now while the conversation goes on

jsignell · 2020-12-22T22:30:47Z

I think we are mostly down to the ufunc tests. It sounds like you have in mind how to fix them?

jsignell · 2020-12-22T22:52:07Z

Ah no, there are failures on https://github.com/dask/dask/pull/6996/checks?check_run_id=1597495686 now as well. Maybe the sparse skip in 90de1b4 should be more nuanced?

jorisvandenbossche · 2021-01-11T13:23:30Z

dask/dataframe/io/tests/test_parquet.py

    ).compute()
-    out["lon"] = out.lon.astype("int")  # just to pass assert
+    # convert categorical to plain int just to pass assert
+    out["lon"] = out.lon.astype(df0.lon.dtype)


There was a int64 vs int32 issue on windows in the 3.8 env with pandas 1.2.0 (https://github.com/dask/dask/pull/6996/checks?check_run_id=1681175945#step:5:249). Not directly sure how that would be caused by pandas, though, or why it started failing now.

Hmm, still failing with this change ..

jorisvandenbossche · 2021-01-11T14:02:50Z

dask/dataframe/tests/test_rolling.py

-    prolling = df.a.rolling(window, center=center)
-    drolling = ddf.a.rolling(window, center=center)
+    prolling = df.a.rolling(window, center=center, min_periods=min_periods)
+    drolling = ddf.a.rolling(window, center=center, min_periods=min_periods)


It seems that this test now fails once in a few runs, see eg https://github.com/dask/dask/pull/6996/checks?check_run_id=1681392897#step:5:191

jsignell · 2021-01-15T19:40:01Z

Is there anything I can do to help with this?

QuLogic · 2021-01-19T10:47:57Z

I've been able to successfully apply this to the Fedora dask package for testing in Rawhide, and it appears to pass tests with Pandas 1.2.0 fine. Other than a rebase/merge, not sure if anything else needs to be done here.

mrocklin · 2021-01-19T16:49:35Z

cc @crusaderky

jorisvandenbossche · 2021-01-19T20:06:49Z

Merged master to resolve conflicts.

I think the last remaining item for this PR is #6996 (comment): the rolling tests are failing once in every few runs (a floating point precision issue, setting the tolerance in the assert to a higher value might solve it, but it's still potentially worth investigating why it fails with pandas 1.2 and not with other versions)

In addition, there are a few follow-ups required (things for which I only added a workaround or skip in this PR), but will list those in a new issue.

jorisvandenbossche · 2021-01-19T20:32:22Z

See eg the failure in the last builds: https://github.com/dask/dask/pull/6996/checks?check_run_id=1730329886#step:5:192 (here it failed in the Mac build, but it's not Mac-specific, it failed before on Linux as well)

jsignell · 2021-01-20T13:43:49Z

setting the tolerance in the assert to a higher value might solve it

There are other places where we have upped the tolerance for specific pandas versions. https://github.com/dask/dask/blob/72304a94c98ace592f01df91e3d9e89febda307c/dask/dataframe/tests/test_rolling.py#L263:L270 we should probably do something similar here.

jsignell · 2021-01-21T14:51:40Z

I'm going to up the tolerance on the rolling precision tests so that we can hopefully get this in before the release on friday.

jsignell · 2021-01-21T15:53:56Z

I think we should change back "run upstream every time" and merge this.

.github/workflows/ci.yml

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche · 2021-01-21T16:05:29Z

Thanks for those last updates! I think it's then indeed good to to be merged.
(I will create an issue with the follow-up tasks identified here)

jsignell · 2021-01-21T16:05:40Z

continuous_integration/scripts/install.sh

+    # nightly numpy/pandas again
    conda update -y -c arrow-nightlies pyarrow

+    conda uninstall --force pandas


I am just leaving this as pandas for now since #6896 and #7084 aren't in yet.

jsignell · 2021-01-21T16:19:13Z

hmm the special commit message for test-upstream doesn't seem to work as expected. That isn't this PR's job though. Will merge when green.

jsignell · 2021-01-21T17:39:22Z

I pushed an empty commit to try to get the tests to pass.

jsignell · 2021-01-21T18:28:24Z

@jorisvandenbossche 👏 👏 merged!

jorisvandenbossche · 2021-01-22T14:21:28Z

And opened the follow-up issue now listing all outstanding issues to resolve: #7100

Test pandas 1.2.0rc0 [test-upstream]

b82dc21

jorisvandenbossche force-pushed the test-pandas-1.2 branch from 40b71bd to b82dc21 Compare December 19, 2020 13:05

jorisvandenbossche and others added 2 commits December 19, 2020 15:48

temp remove conditional

29dbfb4

Fixup

89d559e

disable warning->error to see actual errors

af492f2

jorisvandenbossche mentioned this pull request Dec 21, 2020

RLS: 1.2 pandas-dev/pandas#37784

Closed

jorisvandenbossche and others added 4 commits December 21, 2020 16:20

disable warning->error to see actual errors

b0f4b29

Use .loc when indexing with date strings

1596c94

Skip if sparse is too old, rather than xfailing

90de1b4

Add PANDAS_GT_120 and stop xfailing test_describe

8d2405e

Add min_periods for rolling.count

3f1cb7a

jorisvandenbossche commented Dec 22, 2020

View reviewed changes

dask/dataframe/tests/test_indexing.py Outdated Show resolved Hide resolved

Update dask/dataframe/tests/test_indexing.py

591cacc

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Skip std on datetimes if pandas > 1.2.0

c1bb5b4

Fix check_less_precise

6436cc7

Xfail parquet categorical

5482308

dtype issue on windows

d9c960d

jorisvandenbossche commented Jan 11, 2021

View reviewed changes

another attempt to fix windows

30663ad

jorisvandenbossche mentioned this pull request Jan 12, 2021

[DO NOT MERGE] Test pyarrow nightly in anticipation of pyarrow 3.0 release #7061

Closed

jsignell mentioned this pull request Jan 15, 2021

ERROR collecting dask/dataframe/tests/test_rolling.py #7072

Closed

Merge remote-tracking branch 'upstream/master' into test-pandas-1.2

e73ef64

Up the precision check tolerance by a factor of 2

e9d26d6

Revert ci-upstream workflow [test-upstream]

d6d522a

jorisvandenbossche commented Jan 21, 2021

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

jsignell mentioned this pull request Jan 21, 2021

2021.01.1 release dask/community#121

Closed

Update .github/workflows/ci.yml

c51d219

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jsignell reviewed Jan 21, 2021

View reviewed changes

Empty commit

57aba10

jsignell merged commit 4395973 into dask:master Jan 21, 2021

jorisvandenbossche deleted the test-pandas-1.2 branch January 22, 2021 13:54

jorisvandenbossche mentioned this pull request Jan 22, 2021

Follow-up compatibility issues with pandas #7100

Closed

7 tasks

Uh oh!

Conversation

jorisvandenbossche commented Dec 19, 2020

Uh oh!

TomAugspurger commented Dec 19, 2020

Uh oh!

jsignell commented Dec 22, 2020

Uh oh!

jorisvandenbossche commented Dec 22, 2020

Uh oh!

jsignell commented Dec 22, 2020

Uh oh!

Uh oh!

jorisvandenbossche commented Dec 22, 2020

Uh oh!

jorisvandenbossche commented Dec 22, 2020

Uh oh!

jsignell commented Dec 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Dec 22, 2020

Uh oh!

jsignell commented Dec 22, 2020

Uh oh!

jorisvandenbossche commented Dec 22, 2020

Uh oh!

jorisvandenbossche commented Dec 22, 2020

Uh oh!

jsignell commented Dec 22, 2020

Uh oh!

jsignell commented Dec 22, 2020

Uh oh!

jsignell commented Dec 22, 2020

Uh oh!

jorisvandenbossche Jan 11, 2021

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jan 11, 2021

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jan 11, 2021

Choose a reason for hiding this comment

Uh oh!

jsignell commented Jan 15, 2021

Uh oh!

QuLogic commented Jan 19, 2021

Uh oh!

mrocklin commented Jan 19, 2021

Uh oh!

jorisvandenbossche commented Jan 19, 2021

Uh oh!

jorisvandenbossche commented Jan 19, 2021

Uh oh!

jsignell commented Jan 20, 2021

Uh oh!

jsignell commented Jan 21, 2021

Uh oh!

jsignell commented Jan 21, 2021

Uh oh!

Uh oh!

jorisvandenbossche commented Jan 21, 2021

Uh oh!

jsignell Jan 21, 2021

Choose a reason for hiding this comment

Uh oh!

jsignell commented Jan 21, 2021

Uh oh!

jsignell commented Jan 21, 2021

Uh oh!

jsignell commented Jan 21, 2021

Uh oh!

jorisvandenbossche commented Jan 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

jsignell commented Dec 22, 2020 •

edited

Loading