Support new `axis=None` behavior in `pandas` 2.0 for certain reductions by jrbourbeau · Pull Request #9867 · dask/dask

jrbourbeau · 2023-01-23T22:36:00Z

In pandas=2.0 certain DataFrame reductions (e.g. DataFrame.min(), DataFrame.max()) will now be performed over both axes and a scalar returned when axis=None. This PR implements the same thing here.

xref pandas-dev/pandas#50593

EDIT: Note that this change is currently showing up in the test suite as dask/dataframe/tests/test_ufunc.py::test_ufunc_with_reduction failing with tracebacks like the following

Details:

_________________ test_ufunc_with_reduction[pandas1-conj-min] __________________
[gw3] linux -- Python 3.10.8 /usr/share/miniconda3/envs/test-environment/bin/python3.10

redfunc = 'min', ufunc = 'conj'
pandas =      A   B         C
0   61  72  0.514242
1   82   8  0.197730
2   60  72  0.433253
3   55  78  0.117509
4   94  51  1...  96  0.482159
15  14  40  0.622287
16  41  70  1.435984
17  14  13  0.525581
18  83  79  1.510639
19  35  70  1.899436

    @pytest.mark.parametrize("redfunc", ["sum", "prod", "min", "max", "mean"])
    @pytest.mark.parametrize("ufunc", _BASE_UFUNCS)
    @pytest.mark.parametrize(
        "pandas",
        [
            pd.Series(np.abs(np.random.randn(100))),
            pd.DataFrame(
                {
                    "A": np.random.randint(1, 100, size=20),
                    "B": np.random.randint(1, 100, size=20),
                    "C": np.abs(np.random.randn(20)),
                }
            ),
        ],
    )
    def test_ufunc_with_reduction(redfunc, ufunc, pandas):
        dask = dd.from_pandas(pandas, 3)
    
        np_redfunc = getattr(np, redfunc)
        np_ufunc = getattr(np, ufunc)
    
        if (
            PANDAS_GT_120
            and (redfunc == "prod")
            and ufunc in ["conj", "square", "negative", "absolute"]
            and isinstance(pandas, pd.DataFrame)
        ):
            # TODO(pandas) follow pandas behaviour?
            # starting with pandas 1.2.0, the ufunc is applied column-wise, and therefore
            # applied on the integer columns separately, overflowing for those columns
            # (instead of being applied on 2D ndarray that was converted to float)
            pytest.xfail("'prod' overflowing with integer columns in pandas 1.2.0")
    
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", RuntimeWarning)
            warnings.simplefilter("ignore", FutureWarning)
            assert isinstance(np_redfunc(dask), (dd.DataFrame, dd.Series, dd.core.Scalar))
>           assert_eq(np_redfunc(np_ufunc(dask)), np_redfunc(np_ufunc(pandas)))

dask/dataframe/tests/test_ufunc.py:535: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
dask/dataframe/utils.py:565: in assert_eq
    b = _maybe_sort(b, check_index)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

a = 0.019972785787842888, check_index = True

    def _maybe_sort(a, check_index: bool):
        # sort by value, then index
        try:
            if is_dataframe_like(a):
                if set(a.index.names) & set(a.columns):
                    a.index.names = [
                        "-overlapped-index-name-%d" % i for i in range(len(a.index.names))
                    ]
                a = a.sort_values(by=methods.tolist(a.columns))
            else:
>               a = a.sort_values()
E               AttributeError: 'numpy.float64' object has no attribute 'sort_values'

dask/dataframe/utils.py:527: AttributeError

…ith_reduction-fixup

jrbourbeau · 2023-02-01T20:13:40Z

Alright, I've decided to punt on skew and kurtosis for the time being. So this PR just handles min, max, and mean. I'll open up an issue for skew and kurtosis so we don't loose track of them. @j-bennet I think this is ready for review whenever you get a chance

j-bennet · 2023-02-01T22:52:03Z

@jrbourbeau

Alright, I've decided to punt on skew and kurtosis for the time being.

and median too, right?

jrbourbeau · 2023-02-01T22:55:37Z

We don't actually have a median method that supports axis= today. If we add axis= support to median then we'd definitely want to account for the new axis=None behavior. But since we don't currently have that, I'm thinking about it as its own separate thing

j-bennet

Looks good! Left some minor comments.

j-bennet · 2023-02-01T22:55:04Z

dask/dataframe/core.py

+                    enforce_metadata=False,
+                    parent_meta=self._meta,
+                )
+                if isinstance(result, DataFrame):


In _reduction_agg, you do this for both DataFrame and Series, but here just DataFrame. Is this enough?

It should be okay as the axis=None case (where a Scalar is returned, not a Series) is handled in the if-block above. _reduction_agg is handling both cases, so needs to be a bit more flexible. That said, I included a commit that should make the logic (hopefully) easier to reason about

j-bennet · 2023-02-01T22:56:26Z

dask/dataframe/core.py

           Further, this method currently does not support filtering out NaN
           values, which is again a difference to Pandas.
        """
+        if PANDAS_GT_200 and axis is None:


Could be a decorator, since you do this in two places.

Agreed. Though in this particular case I think the cognitive load on readers is actually smaller if we just duplicate this simple logic twice. If the logic was much more involved, or if it was used in lots of places, totally agree using a decorator (or some other isolated utility) would be a good practice.

j-bennet · 2023-02-01T22:58:54Z

dask/dataframe/tests/test_arithmetics_reduction.py

            assert_eq(dds.skew(), pds.skew() / bias_factor)

+            if PANDAS_GT_200:
+                # TODO: Remove this `if`-block once `axis=None` support is added


Might be helpful to open an issue and mention the issue here, so the person implementing it could grep the codebase for issue number.

…ith_reduction-fixup

j-bennet

This looks good.

jrbourbeau · 2023-02-06T20:34:24Z

Thanks for reviewing @j-bennet

jrbourbeau added 2 commits January 23, 2023 16:30

Support new axis=None behavior in pandas 2.0 for certain reductions

ed644c6

test-upstream

ed315e5

github-actions bot added the dataframe label Jan 23, 2023

jrbourbeau added 3 commits January 31, 2023 13:44

Merge branch 'main' of https://github.com/dask/dask into test_ufunc_w…

d7ad442

…ith_reduction-fixup

Fixup

32c9ea0

test-upstream

5c48100

jrbourbeau changed the title ~~[WIP] Support new axis=None behavior in pandas 2.0 for certain reductions~~ Support new axis=None behavior in pandas 2.0 for certain reductions Jan 31, 2023

Temporarily punt on skew and kurtosis [test-upstream]

4367634

j-bennet reviewed Feb 1, 2023

View reviewed changes

jrbourbeau mentioned this pull request Feb 2, 2023

axis=None behavior in pandas 2.0 for skew and kurtosis #9915

Open

Add comment

2379b8c

jrbourbeau added the upstream label Feb 2, 2023

jrbourbeau added 3 commits February 2, 2023 15:39

Merge branch 'main' of https://github.com/dask/dask into test_ufunc_w…

28c6eb4

…ith_reduction-fixup

More

3494700

Fix test

5effae7

j-bennet approved these changes Feb 6, 2023

View reviewed changes

jrbourbeau merged commit 312009f into dask:main Feb 6, 2023

jrbourbeau deleted the test_ufunc_with_reduction-fixup branch February 6, 2023 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support new `axis=None` behavior in `pandas` 2.0 for certain reductions#9867

Support new `axis=None` behavior in `pandas` 2.0 for certain reductions#9867
jrbourbeau merged 10 commits intodask:mainfrom
jrbourbeau:test_ufunc_with_reduction-fixup

jrbourbeau commented Jan 23, 2023 •

edited

Loading

Uh oh!

jrbourbeau commented Feb 1, 2023

Uh oh!

j-bennet commented Feb 1, 2023

Uh oh!

jrbourbeau commented Feb 1, 2023 •

edited

Loading

Uh oh!

j-bennet left a comment

Uh oh!

j-bennet Feb 1, 2023

Uh oh!

jrbourbeau Feb 2, 2023

Uh oh!

j-bennet Feb 1, 2023

Uh oh!

jrbourbeau Feb 2, 2023

Uh oh!

j-bennet Feb 1, 2023

Uh oh!

jrbourbeau Feb 2, 2023

Uh oh!

j-bennet left a comment

Uh oh!

jrbourbeau commented Feb 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jrbourbeau commented Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrbourbeau commented Feb 1, 2023

Uh oh!

j-bennet commented Feb 1, 2023

Uh oh!

jrbourbeau commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j-bennet left a comment

Choose a reason for hiding this comment

Uh oh!

j-bennet Feb 1, 2023

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Feb 2, 2023

Choose a reason for hiding this comment

Uh oh!

j-bennet Feb 1, 2023

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Feb 2, 2023

Choose a reason for hiding this comment

Uh oh!

j-bennet Feb 1, 2023

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Feb 2, 2023

Choose a reason for hiding this comment

Uh oh!

j-bennet left a comment

Choose a reason for hiding this comment

Uh oh!

jrbourbeau commented Feb 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jrbourbeau commented Jan 23, 2023 •

edited

Loading

jrbourbeau commented Feb 1, 2023 •

edited

Loading