Skip to content

Test pandas 1.1.x / 1.2.0 releases and pandas nightly#6996

Merged
jsignell merged 44 commits intodask:masterfrom
jorisvandenbossche:test-pandas-1.2
Jan 21, 2021
Merged

Test pandas 1.1.x / 1.2.0 releases and pandas nightly#6996
jsignell merged 44 commits intodask:masterfrom
jorisvandenbossche:test-pandas-1.2

Conversation

@jorisvandenbossche
Copy link
Member

No description provided.

@TomAugspurger
Copy link
Member

Thanks for starting this, I've been a bit busy lately :)

@jsignell
Copy link
Member

I fixed some of the obvious ones. Mind if I push?

@jorisvandenbossche
Copy link
Member Author

Great, feel free to push!

Some notes about the failures I already investigated up to now:

@jsignell
Copy link
Member

I was just reading the Future Warrnings and trying to fix up any that I could.

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Member Author

The following failure is due to pandas-dev/pandas#28507, comparing tz-naive and tz-aware timestamp no longer raises an error, but returns False:

 ___________________________ test_set_index_timezone ____________________________

    def test_set_index_timezone():
        s_naive = pd.Series(pd.date_range("20130101", periods=3))
        s_aware = pd.Series(pd.date_range("20130101", periods=3, tz="US/Eastern"))
        df = pd.DataFrame({"tz": s_aware, "notz": s_naive})
        d = dd.from_pandas(df, 2)
    
        d1 = d.set_index("notz", npartitions=1)
        s1 = pd.DatetimeIndex(s_naive.values, dtype=s_naive.dtype)
        assert d1.divisions[0] == s_naive[0] == s1[0]
        assert d1.divisions[-1] == s_naive[2] == s1[2]
    
        # We currently lose "freq".  Converting data with pandas-defined dtypes
        # to numpy or pure Python can be lossy like this.
        d2 = d.set_index("tz", npartitions=1)
        s2 = pd.DatetimeIndex(s_aware, dtype=s_aware.dtype)
        assert d2.divisions[0] == s2[0]
        assert d2.divisions[-1] == s2[2]
        assert d2.divisions[0].tz == s2[0].tz
        assert d2.divisions[0].tz is not None
        s2badtype = pd.DatetimeIndex(s_aware.values, dtype=s_naive.dtype)
        with pytest.raises(TypeError):
>           d2.divisions[0] == s2badtype[0]
E           Failed: DID NOT RAISE <class 'TypeError'>

dask/dataframe/tests/test_shuffle.py:650: Failed

@jorisvandenbossche
Copy link
Member Author

For the failing dask/dataframe/tests/test_arithmetics_reduction.py::test_reductions_non_numeric_dtypes, that's because std() is now implemented for datetime dtype.

@jsignell
Copy link
Member

jsignell commented Dec 22, 2020

For the failing dask/dataframe/tests/test_arithmetics_reduction.py::test_reductions_non_numeric_dtypes, that's because std() is now implemented for datetime dtype.

It seems reasonable to me to just skip that test for this case until someone gets around to implementing it. The dask implementation depends on var, which pandas doesn't yet have for datetime.

diff --git a/dask/dataframe/tests/test_arithmetics_reduction.py b/dask/dataframe/tests/test_arithmetics_reduction.py
index c04d3c07..2758de23 100644
--- a/dask/dataframe/tests/test_arithmetics_reduction.py
+++ b/dask/dataframe/tests/test_arithmetics_reduction.py
@@ -6,7 +6,7 @@ import numpy as np
 import pandas as pd
 
 import dask.dataframe as dd
-from dask.dataframe._compat import PANDAS_GT_100, PANDAS_VERSION
+from dask.dataframe._compat import PANDAS_GT_100, PANDAS_GT_120, PANDAS_VERSION
 from dask.dataframe.utils import (
     assert_eq,
     assert_dask_graph,
@@ -1002,7 +1002,12 @@ def test_reductions_non_numeric_dtypes():
         assert_eq(dds.min(), pds.min())
         assert_eq(dds.max(), pds.max())
         assert_eq(dds.count(), pds.count())
-        check_raises(dds, pds, "std")
+        if PANDAS_GT_120 and pds.dtype == "datetime64[ns]":
+            # std is implemented for datetimes in pandas 1.2.0, but dask
+            # implementation depends on var which isn't
+            pass
+        else:
+            check_raises(dds, pds, "std")
         check_raises(dds, pds, "var")
         check_raises(dds, pds, "sem")
         check_raises(dds, pds, "skew")

@jorisvandenbossche
Copy link
Member Author

It seems reasonable to me to just skip that test for this case until someone gets around to implementing it. The dask implementation depends on var, which pandas doesn't yet have for datetime.

Indeed, that sounds best. We should open an issue for that on the pandas side.

The parquet failures are a bit strange: I can reproduce them locally in an environment with pandas 1.2.0rc, but it seems I also get the same failure with the latest stable pandas (but here on CI there is no such failure). So there might be an interaction with another library version in play, don't directly see it.

@jsignell
Copy link
Member

The parquet failures are a bit strange

I have been seeing the parquet failures locally for a while. I can't tell if it is related to pandas version or not. I don't think dask has been doing any tests against pandas > 1.0.* except for the upstream ones which have been failing for a while #6148

@jorisvandenbossche
Copy link
Member Author

Ah, indeed, it's already failing on pandas 1.1, but passing on pandas 1.0. So basically writing a partitioned parquet dataset where the partition column is categorical dtype is completely broken.

I don't think dask has been doing any tests against pandas > 1.0.*

Ai .. pandas 1.1 is a half year old, so we probably should have been testing that ..

@jorisvandenbossche
Copy link
Member Author

I opened pandas-dev/pandas#38642 for this on the pandas side.

@jsignell
Copy link
Member

So basically writing a partitioned parquet dataset where the partition column is categorical dtype is completely broken.

Maybe we should xfail those tests for now while the conversation goes on

@jsignell
Copy link
Member

I think we are mostly down to the ufunc tests. It sounds like you have in mind how to fix them?

@jsignell
Copy link
Member

Ah no, there are failures on https://github.com/dask/dask/pull/6996/checks?check_run_id=1597495686 now as well. Maybe the sparse skip in 90de1b4 should be more nuanced?

).compute()
out["lon"] = out.lon.astype("int") # just to pass assert
# convert categorical to plain int just to pass assert
out["lon"] = out.lon.astype(df0.lon.dtype)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a int64 vs int32 issue on windows in the 3.8 env with pandas 1.2.0 (https://github.com/dask/dask/pull/6996/checks?check_run_id=1681175945#step:5:249). Not directly sure how that would be caused by pandas, though, or why it started failing now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, still failing with this change ..

prolling = df.a.rolling(window, center=center)
drolling = ddf.a.rolling(window, center=center)
prolling = df.a.rolling(window, center=center, min_periods=min_periods)
drolling = ddf.a.rolling(window, center=center, min_periods=min_periods)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this test now fails once in a few runs, see eg https://github.com/dask/dask/pull/6996/checks?check_run_id=1681392897#step:5:191

@jsignell
Copy link
Member

Is there anything I can do to help with this?

@QuLogic
Copy link
Contributor

QuLogic commented Jan 19, 2021

I've been able to successfully apply this to the Fedora dask package for testing in Rawhide, and it appears to pass tests with Pandas 1.2.0 fine. Other than a rebase/merge, not sure if anything else needs to be done here.

@mrocklin
Copy link
Member

cc @crusaderky

@jorisvandenbossche
Copy link
Member Author

Merged master to resolve conflicts.

I think the last remaining item for this PR is #6996 (comment): the rolling tests are failing once in every few runs (a floating point precision issue, setting the tolerance in the assert to a higher value might solve it, but it's still potentially worth investigating why it fails with pandas 1.2 and not with other versions)

In addition, there are a few follow-ups required (things for which I only added a workaround or skip in this PR), but will list those in a new issue.

@jorisvandenbossche
Copy link
Member Author

See eg the failure in the last builds: https://github.com/dask/dask/pull/6996/checks?check_run_id=1730329886#step:5:192 (here it failed in the Mac build, but it's not Mac-specific, it failed before on Linux as well)

@jsignell
Copy link
Member

setting the tolerance in the assert to a higher value might solve it

There are other places where we have upped the tolerance for specific pandas versions. https://github.com/dask/dask/blob/72304a94c98ace592f01df91e3d9e89febda307c/dask/dataframe/tests/test_rolling.py#L263:L270 we should probably do something similar here.

@jsignell
Copy link
Member

I'm going to up the tolerance on the rolling precision tests so that we can hopefully get this in before the release on friday.

@jsignell
Copy link
Member

I think we should change back "run upstream every time" and merge this.

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Member Author

Thanks for those last updates! I think it's then indeed good to to be merged.
(I will create an issue with the follow-up tasks identified here)

# nightly numpy/pandas again
conda update -y -c arrow-nightlies pyarrow

conda uninstall --force pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just leaving this as pandas for now since #6896 and #7084 aren't in yet.

@jsignell
Copy link
Member

hmm the special commit message for test-upstream doesn't seem to work as expected. That isn't this PR's job though. Will merge when green.

@jsignell
Copy link
Member

I pushed an empty commit to try to get the tests to pass.

@jsignell jsignell merged commit 4395973 into dask:master Jan 21, 2021
@jsignell
Copy link
Member

@jorisvandenbossche 👏 👏 merged!

@jorisvandenbossche
Copy link
Member Author

And opened the follow-up issue now listing all outstanding issues to resolve: #7100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants