Support cudf as a DataFrame backend by rjzamora · Pull Request #212 · dask/dask-expr

rjzamora · 2023-07-05T20:42:19Z

Includes the necessary changes for cudf to be used as an alternative to pandas as the DataFrame backend
Uses the DASK_DATAFRAME__BACKEND environment variable (with default of "pandas") to set the backend library in test_io.py and test_collection.py. This approach can be expanded incrementally throughout the test suite without changing much code. I added skip/xfail statements for cudf in a few places, but we don't need to do this for all tests that fail with cudf (it is valuable to be able to "run" tests with a cudf backend either way).
Does not work with latest release of cudf, nor the cudf nightly, because neither of these cudf versions support pandas>=2
- I am currently working with a variation of the pandas_2.0_feature_branch development branch

mrocklin · 2023-07-06T19:06:05Z

Oh cool. I'm curious to see what comes here.

My sense (correct me if I'm wrong) is that we probably don't want to annotate every test in order to support GPUs. I'm thinking that this is just a temporary thing as you're playing around. If not then let's chat.

rjzamora · 2023-07-06T19:46:13Z

My sense (correct me if I'm wrong) is that we probably don't want to annotate every test in order to support GPUs. I'm thinking that this is just a temporary thing as you're playing around. If not then let's chat.

Right, I'd rather not annotate everything, and I'm definitely still exploring here.

It isn't difficult to parameterize the df fixture to use both pandas and cudf (when available). However, it turns out that many tests fail for cudf. In some cases, the failures are from bugs (e.g. rapidsai/cudf#13666). In other cases it's because of limits in cudf's API coverage. For example, explode doesn't work because we assume that the backend supports a list of columns (and cudf doesn't), and idxmin/max doesn't work, because cudf only has this operation implemented for groupby.

At the moment, I'm thinking that I should just configure pdf/df with something like:

@pytest.fixture(
    params=[
        "pandas",
        pytest.param("cudf", marks=pytest.mark.skipif(cudf is None, reason="cudf not found.")),
    ]
)
def backend(request):
    yield request.param

I'm thinking this would allow us to skip (maybe xfail?) specific tests for the "cudf" backend instead of annotating every test. That said, I'm still feeling open to completely different approaches. I'm certainly preoccupied with finding/fixing the actual bugs that I'm finding.

rjzamora · 2023-07-13T15:56:14Z

Update:

This PR is currently blocked by Enforce deterministic tokenize everywhere #223 and Enable deterministic tokenization for cudf objects in dask rapidsai/cudf#13692 (related to deterministic tokenization)
In order to reduce the test complexity here, I am thinking that I may roll back the pandas/cudf parameterization, and instead use an environment variable (e.g. TEST_DASK_EXPR_BACKEND) to define the backend library to import/use. This would allow one to run tests with the cudf backend by executing something like: TEST_DASK_EXPR_BACKEND=cudf py.test -v dask_expr

rjzamora · 2023-07-18T13:45:53Z

dask_expr/io/tests/test_io.py

+# Import backend DataFrame library to test
+BACKEND = config.get("dataframe.backend", "pandas")
+lib = importlib.import_module(BACKEND)


@phofl - Any opinions on using this approach as a "switch" to test the cudf backend?

Certainly interested in your thoughts as well @mrocklin

I think this is ok, although it will take me some time to get used to it😂

traveling today, will take a closer look at the whole pr when I am back

dask_expr/tests/test_collection.py

phofl

Couple of comments

dask_expr/tests/test_collection.py

dask_expr/io/parquet.py

rjzamora · 2023-07-24T17:17:44Z

dask_expr/_expr.py

+        if self._required_attribute:
+            dep = next(iter(self.dependencies()))._meta
+            if not hasattr(dep, self._required_attribute):
+                # Raise a ValueError instead of AttributeError to
+                # avoid infinite recursion
+                raise ValueError(f"{dep} has no attribute {self._required_attribute}")
+
+    @property
+    def _required_attribute(self) -> str:
+        # Specify if the first `dependency` must support
+        # a specific attribute for valid behavior.
+        return None


@phofl - I ran into quite a few cases where the cudf backend was hanging (rather than quickly failing) for cases where a cudf.DataFrame/Series did not support the same attribute as pd.DataFrame/Series (e.g. nbytes, align, etc).

What do you think about doing something like this so that we can more-quickly detect that the dependency is missing a necessary attribute?

Oh yes I like this.

dask_expr/_reductions.py

phofl · 2023-07-25T18:22:12Z

thx!

rjzamora · 2023-07-25T18:24:22Z

woohoo

rjzamora added 2 commits July 5, 2023 13:37

basic cudf backend support

9875b27

formatting

3afa3b9

rjzamora added 5 commits July 6, 2023 12:56

partial revision

bead5cb

configure with backend fixture

4381738

revert xdf name back to df

3d229a4

Merge remote-tracking branch 'upstream/main' into cudf-backend

c947cfa

Merge remote-tracking branch 'upstream/main' into cudf-backend

f3c466c

rjzamora mentioned this pull request Jul 12, 2023

Enforce deterministic tokenize everywhere #223

Merged

Merge remote-tracking branch 'upstream/main' into cudf-backend

55d4a9b

rjzamora added 8 commits July 13, 2023 09:07

Merge remote-tracking branch 'upstream/main' into cudf-backend

489f6ef

fix pdf in test

06f3bf8

fix predicate-pushdown test

7d7a400

rely on DASK_DATAFRAME__BACKEND=cudf for now

934bd03

add _set_engine utility for parquet

65252c4

remove unnecesary engine arg

b2ebdf9

Merge remote-tracking branch 'upstream/main' into cudf-backend

d907084

fix test

8573040

rjzamora changed the title ~~Support and test the cudf backend~~ Support cudf as a DataFrame backend Jul 18, 2023

rjzamora marked this pull request as ready for review July 18, 2023 13:36

rjzamora commented Jul 18, 2023

View reviewed changes

dask_expr/tests/test_collection.py Outdated Show resolved Hide resolved

rjzamora added 4 commits July 18, 2023 09:26

revert pdf renaming

5c4c019

Merge remote-tracking branch 'upstream/main' into cudf-backend

07934a2

update test

6d308d7

address problems with cudf var/std behavior

b9624da

phofl reviewed Jul 24, 2023

View reviewed changes

dask_expr/tests/test_collection.py Outdated Show resolved Hide resolved

dask_expr/tests/test_collection.py Outdated Show resolved Hide resolved

dask_expr/io/parquet.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into cudf-backend

1f203f9

rjzamora added 4 commits July 24, 2023 08:22

us decorators

f502717

rename _set_engine to _set_parquet_engine

fc88402

introduce _required_attribute

5d927c1

Merge remote-tracking branch 'upstream/main' into cudf-backend

c766ed8

rjzamora commented Jul 24, 2023

View reviewed changes

dask_expr/_reductions.py Outdated Show resolved Hide resolved

rjzamora added 2 commits July 24, 2023 12:19

Update dask_expr/_reductions.py

7f87c8f

Merge remote-tracking branch 'upstream/main' into cudf-backend

b8b3896

phofl approved these changes Jul 25, 2023

View reviewed changes

phofl merged commit 20cf274 into dask:main Jul 25, 2023

rjzamora deleted the cudf-backend branch July 25, 2023 18:24

rjzamora mentioned this pull request Jul 26, 2023

Generalize DataFrame backend for remaining tests #246

Merged

Uh oh!

Conversation

rjzamora commented Jul 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Jul 6, 2023

Uh oh!

rjzamora commented Jul 6, 2023

Uh oh!

rjzamora commented Jul 13, 2023

Uh oh!

rjzamora Jul 18, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora Jul 18, 2023

Choose a reason for hiding this comment

Uh oh!

phofl Jul 18, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phofl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rjzamora Jul 24, 2023

Choose a reason for hiding this comment

Uh oh!

phofl Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phofl commented Jul 25, 2023

Uh oh!

rjzamora commented Jul 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rjzamora commented Jul 5, 2023 •

edited

Loading