Skip to content

[POC] Add general class-dispatching mechanism#321

Closed
rjzamora wants to merge 8 commits intodask:mainfrom
rjzamora:class-dispatch
Closed

[POC] Add general class-dispatching mechanism#321
rjzamora wants to merge 8 commits intodask:mainfrom
rjzamora:class-dispatch

Conversation

@rjzamora
Copy link
Copy Markdown
Member

@rjzamora rjzamora commented Oct 5, 2023

While pushing on dask-expr + cudf support, I noticed that we are still pretty far from the API coverage of dask-cudf.

Dask-cudf currently relies on several forms of “dispatching" to smooth over differences between cudf and pandas:

  1. Sub-classing of the DataFrame/Series/Index classes
  2. Type-based dispatching of compute functions
  3. Backend-config dispatching
  4. Array protocol

I was hoping to avoid (or at least delay) the need for anything new in dask-expr, and so we have almost entirely leaned on the existing dispatching code. However, in order to support differences in sorting, groupby aggregation, and shuffling behavior, we now need to either (a) put in a lot of work to align the existing algorithms, or (b) add something similar to (1) as well.

Since dask-expr also uses an expansive Expr class structure under the hood, I am also thinking that we need general class-level dispatching instead of simple DataFrame/Series/Index subclassing. For example, there are probably several areas where we just want to "lower" an expression differently for cudf, and don't really need to redefine the collection-level API. For cases like this, we can just register a custom class dispatch for the problematic expression, and override the _lower method only.

This POC illustrates one possible solution for a general dispatch system in which any user or external library can effectively override a collection or expression class for a specific metadata type. Note that I am trying hard to design something as simple as possible, so suggestions are very welcome.

How it's used

From the user perspective, the proposed design is probably as simple as it gets: You are allowed to register a custom dispatch for any FrameBase or Expr class by decorating your code with @<class name>.register_dispatch(<meta type>). For example, if I want to define my own cudf-specific DataFrame collection, I can just do something like:

import cudf
from dask_expr import DataFrame

@DataFrame.register_dispatch(cudf.DataFrame)
class CudfDataFrame(DataFrame):
    ... whatever methods I want to override ...

If the special behavior you want is more specific to a single Expr subclass, you can register a custom dispatch for that class in the same way. For example, if you just want cudf to do something different for groupby.agg, you can do something like:

from dask_expr import Expr
from dask_expr._groupby import GroupbyAggregation

@GroupbyAggregation.register_dispatch(cudf.DataFrame)
class CudfGroupbyAggregation(Expr):
    ... my custom groupby-agg logic ...

Important Note: For typical dask-expr development with a pandas backend, nothing should change as a result of this PR.

How it works

The new system requires very few changes to FrameBase and Expr, because (1) it leverages the existing type-dispatching logic in dask.utils.Dispatch, and (2) it does not require any additional code to register the default FrameBase/Expr classes that are already defined in dask-expr.

Registering of dispatch classes works in the same way that dispatch functions are registered in dask/dask. When base_cls.register_dispatch(meta_type, custom_cls) is used to register a custom class for a (base_class, meta_type) combination, we essentially add the custom class to a Dict[base_class, dask.utils.Dispatch] structure (where the dask.utils.Dispatch element handles the type-based dispatching).

At runtime, class dispatching happens within cls.__new__, which is called before object initialization. Since we can inspect the initialization arguments within cls.__new__, we can also check if the first argument (arg) is an Expr instance, and if we have a custom class registered for the combination of cls and arg._meta. If we do have something registered, we simply use the custom class to create our object instead of cls. This means that our custom class must accept the same __init__ arguments as cls, but attributes like _meta/_lower/etc can all be different.

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Oct 5, 2023

Looking only briefly at this, I'll admit that I'm not smart enough to understand the proposed design by looking at the code. I think it might be helpful to write down a bit more at a high level how this would work. I'd aim for something that you'd send to a new developer to help explain the system. (If it's sufficiently complex that this would be hard, then it's a sign that it might not be a good design).

In general, my bias is to be very wary of adding a new dimension of dispatching. Currently I'm trying to think about how to extend dask-expr to arrays, and I'd be quite concened if adding a system like this made it harder to think about expressions such that that arrays became harder to achieve.

@rjzamora
Copy link
Copy Markdown
Member Author

rjzamora commented Oct 6, 2023

I think it might be helpful to write down a bit more at a high level how this would work.

Thanks for taking a look @mrocklin - I added a quick overview, but may be able to revise/simplify later. Overall, I'm aiming for something that is very light-weight, and mostly "invisible" within dask-expr itself.

I think we are on the same page in terms of wanting to avoid complexity. I spent a lot of time struggling with this, because it is becoming apparent to me that the current collection API will need a lot of ugly compatibility code if we don't break out cudf-based collections into a distinct API (like dask-cudf). However, even if we do something like the new_dd_object trick, I will still probably end up spending a lot of time duplicating Expr code for the cudf-specific API. I suppose I'm really looking for a solution where RAPIDS can be as surgical/efficient as possible about overriding default dask-expr behavior, without requiring anything heavy in dask-expr itself.

@phofl
Copy link
Copy Markdown
Collaborator

phofl commented Oct 11, 2023

I am a bit concerned about all these dimensions that you could end up registering dispatching mechanisms, but I don't see a better way of achieving what you are aiming for.

@rjzamora
Copy link
Copy Markdown
Member Author

but I don't see a better way of achieving what you are aiming for.

Right - I think it is worth mentioning that it may be possible to use a simple *.substitute operation to get what we need, but only if expressions were refactored so that _meta/divisions/npartitions were not calculated until absolutely necessary (e.g. after the expression was "lowered"). Querying one of these properties from the collection API would need to trigger a lower pass. I do think we should do this for divisions/npartitions, but I have doubts that this makes sense for _meta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants