Skip to content

Support cudf as a DataFrame backend#212

Merged
phofl merged 27 commits intodask:mainfrom
rjzamora:cudf-backend
Jul 25, 2023
Merged

Support cudf as a DataFrame backend#212
phofl merged 27 commits intodask:mainfrom
rjzamora:cudf-backend

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Jul 5, 2023

  • Includes the necessary changes for cudf to be used as an alternative to pandas as the DataFrame backend
  • Uses the DASK_DATAFRAME__BACKEND environment variable (with default of "pandas") to set the backend library in test_io.py and test_collection.py. This approach can be expanded incrementally throughout the test suite without changing much code. I added skip/xfail statements for cudf in a few places, but we don't need to do this for all tests that fail with cudf (it is valuable to be able to "run" tests with a cudf backend either way).
  • Does not work with latest release of cudf, nor the cudf nightly, because neither of these cudf versions support pandas>=2

@mrocklin
Copy link
Member

mrocklin commented Jul 6, 2023

Oh cool. I'm curious to see what comes here.

My sense (correct me if I'm wrong) is that we probably don't want to annotate every test in order to support GPUs. I'm thinking that this is just a temporary thing as you're playing around. If not then let's chat.

@rjzamora
Copy link
Member Author

rjzamora commented Jul 6, 2023

My sense (correct me if I'm wrong) is that we probably don't want to annotate every test in order to support GPUs. I'm thinking that this is just a temporary thing as you're playing around. If not then let's chat.

Right, I'd rather not annotate everything, and I'm definitely still exploring here.

It isn't difficult to parameterize the df fixture to use both pandas and cudf (when available). However, it turns out that many tests fail for cudf. In some cases, the failures are from bugs (e.g. rapidsai/cudf#13666). In other cases it's because of limits in cudf's API coverage. For example, explode doesn't work because we assume that the backend supports a list of columns (and cudf doesn't), and idxmin/max doesn't work, because cudf only has this operation implemented for groupby.

At the moment, I'm thinking that I should just configure pdf/df with something like:

@pytest.fixture(
    params=[
        "pandas",
        pytest.param("cudf", marks=pytest.mark.skipif(cudf is None, reason="cudf not found.")),
    ]
)
def backend(request):
    yield request.param

I'm thinking this would allow us to skip (maybe xfail?) specific tests for the "cudf" backend instead of annotating every test. That said, I'm still feeling open to completely different approaches. I'm certainly preoccupied with finding/fixing the actual bugs that I'm finding.

@rjzamora
Copy link
Member Author

Update:

@rjzamora rjzamora changed the title Support and test the cudf backend Support cudf as a DataFrame backend Jul 18, 2023
@rjzamora rjzamora marked this pull request as ready for review July 18, 2023 13:36
Comment on lines +14 to +16
# Import backend DataFrame library to test
BACKEND = config.get("dataframe.backend", "pandas")
lib = importlib.import_module(BACKEND)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl - Any opinions on using this approach as a "switch" to test the cudf backend?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly interested in your thoughts as well @mrocklin

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok, although it will take me some time to get used to it😂

traveling today, will take a closer look at the whole pr when I am back

Copy link
Collaborator

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments

Comment on lines +57 to +68
if self._required_attribute:
dep = next(iter(self.dependencies()))._meta
if not hasattr(dep, self._required_attribute):
# Raise a ValueError instead of AttributeError to
# avoid infinite recursion
raise ValueError(f"{dep} has no attribute {self._required_attribute}")

@property
def _required_attribute(self) -> str:
# Specify if the first `dependency` must support
# a specific attribute for valid behavior.
return None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl - I ran into quite a few cases where the cudf backend was hanging (rather than quickly failing) for cases where a cudf.DataFrame/Series did not support the same attribute as pd.DataFrame/Series (e.g. nbytes, align, etc).

What do you think about doing something like this so that we can more-quickly detect that the dependency is missing a necessary attribute?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes I like this.

@phofl phofl merged commit 20cf274 into dask:main Jul 25, 2023
@phofl
Copy link
Collaborator

phofl commented Jul 25, 2023

thx!

@rjzamora
Copy link
Member Author

woohoo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants