Skip to content

Support new axis=None behavior in pandas 2.0 for certain reductions#9867

Merged
jrbourbeau merged 10 commits intodask:mainfrom
jrbourbeau:test_ufunc_with_reduction-fixup
Feb 6, 2023
Merged

Support new axis=None behavior in pandas 2.0 for certain reductions#9867
jrbourbeau merged 10 commits intodask:mainfrom
jrbourbeau:test_ufunc_with_reduction-fixup

Conversation

@jrbourbeau
Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau commented Jan 23, 2023

In pandas=2.0 certain DataFrame reductions (e.g. DataFrame.min(), DataFrame.max()) will now be performed over both axes and a scalar returned when axis=None. This PR implements the same thing here.

xref pandas-dev/pandas#50593

  • DataFrame.min
  • DataFrame.max
  • DataFrame.mean
  • DataFrame.median
  • DataFrame.skew
  • DataFrame.kurt

EDIT: Note that this change is currently showing up in the test suite as dask/dataframe/tests/test_ufunc.py::test_ufunc_with_reduction failing with tracebacks like the following

Details:
_________________ test_ufunc_with_reduction[pandas1-conj-min] __________________
[gw3] linux -- Python 3.10.8 /usr/share/miniconda3/envs/test-environment/bin/python3.10

redfunc = 'min', ufunc = 'conj'
pandas =      A   B         C
0   61  72  0.514242
1   82   8  0.197730
2   60  72  0.433253
3   55  78  0.117509
4   94  51  1...  96  0.482159
15  14  40  0.622287
16  41  70  1.435984
17  14  13  0.525581
18  83  79  1.510639
19  35  70  1.899436

    @pytest.mark.parametrize("redfunc", ["sum", "prod", "min", "max", "mean"])
    @pytest.mark.parametrize("ufunc", _BASE_UFUNCS)
    @pytest.mark.parametrize(
        "pandas",
        [
            pd.Series(np.abs(np.random.randn(100))),
            pd.DataFrame(
                {
                    "A": np.random.randint(1, 100, size=20),
                    "B": np.random.randint(1, 100, size=20),
                    "C": np.abs(np.random.randn(20)),
                }
            ),
        ],
    )
    def test_ufunc_with_reduction(redfunc, ufunc, pandas):
        dask = dd.from_pandas(pandas, 3)
    
        np_redfunc = getattr(np, redfunc)
        np_ufunc = getattr(np, ufunc)
    
        if (
            PANDAS_GT_120
            and (redfunc == "prod")
            and ufunc in ["conj", "square", "negative", "absolute"]
            and isinstance(pandas, pd.DataFrame)
        ):
            # TODO(pandas) follow pandas behaviour?
            # starting with pandas 1.2.0, the ufunc is applied column-wise, and therefore
            # applied on the integer columns separately, overflowing for those columns
            # (instead of being applied on 2D ndarray that was converted to float)
            pytest.xfail("'prod' overflowing with integer columns in pandas 1.2.0")
    
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", RuntimeWarning)
            warnings.simplefilter("ignore", FutureWarning)
            assert isinstance(np_redfunc(dask), (dd.DataFrame, dd.Series, dd.core.Scalar))
>           assert_eq(np_redfunc(np_ufunc(dask)), np_redfunc(np_ufunc(pandas)))

dask/dataframe/tests/test_ufunc.py:535: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
dask/dataframe/utils.py:565: in assert_eq
    b = _maybe_sort(b, check_index)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

a = 0.019972785787842888, check_index = True

    def _maybe_sort(a, check_index: bool):
        # sort by value, then index
        try:
            if is_dataframe_like(a):
                if set(a.index.names) & set(a.columns):
                    a.index.names = [
                        "-overlapped-index-name-%d" % i for i in range(len(a.index.names))
                    ]
                a = a.sort_values(by=methods.tolist(a.columns))
            else:
>               a = a.sort_values()
E               AttributeError: 'numpy.float64' object has no attribute 'sort_values'

dask/dataframe/utils.py:527: AttributeError

@jrbourbeau jrbourbeau changed the title [WIP] Support new axis=None behavior in pandas 2.0 for certain reductions Support new axis=None behavior in pandas 2.0 for certain reductions Jan 31, 2023
@jrbourbeau
Copy link
Copy Markdown
Member Author

Alright, I've decided to punt on skew and kurtosis for the time being. So this PR just handles min, max, and mean. I'll open up an issue for skew and kurtosis so we don't loose track of them. @j-bennet I think this is ready for review whenever you get a chance

@j-bennet
Copy link
Copy Markdown
Contributor

j-bennet commented Feb 1, 2023

@jrbourbeau

Alright, I've decided to punt on skew and kurtosis for the time being.

and median too, right?

@jrbourbeau
Copy link
Copy Markdown
Member Author

jrbourbeau commented Feb 1, 2023

We don't actually have a median method that supports axis= today. If we add axis= support to median then we'd definitely want to account for the new axis=None behavior. But since we don't currently have that, I'm thinking about it as its own separate thing

Copy link
Copy Markdown
Contributor

@j-bennet j-bennet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left some minor comments.

enforce_metadata=False,
parent_meta=self._meta,
)
if isinstance(result, DataFrame):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _reduction_agg, you do this for both DataFrame and Series, but here just DataFrame. Is this enough?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be okay as the axis=None case (where a Scalar is returned, not a Series) is handled in the if-block above. _reduction_agg is handling both cases, so needs to be a bit more flexible. That said, I included a commit that should make the logic (hopefully) easier to reason about

Further, this method currently does not support filtering out NaN
values, which is again a difference to Pandas.
"""
if PANDAS_GT_200 and axis is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be a decorator, since you do this in two places.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Though in this particular case I think the cognitive load on readers is actually smaller if we just duplicate this simple logic twice. If the logic was much more involved, or if it was used in lots of places, totally agree using a decorator (or some other isolated utility) would be a good practice.

assert_eq(dds.skew(), pds.skew() / bias_factor)

if PANDAS_GT_200:
# TODO: Remove this `if`-block once `axis=None` support is added
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be helpful to open an issue and mention the issue here, so the person implementing it could grep the codebase for issue number.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Copy Markdown
Contributor

@j-bennet j-bennet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

@jrbourbeau
Copy link
Copy Markdown
Member Author

Thanks for reviewing @j-bennet

@jrbourbeau jrbourbeau merged commit 312009f into dask:main Feb 6, 2023
@jrbourbeau jrbourbeau deleted the test_ufunc_with_reduction-fixup branch February 6, 2023 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants