[POC] Add general class-dispatching mechanism#321
[POC] Add general class-dispatching mechanism#321
Conversation
|
Looking only briefly at this, I'll admit that I'm not smart enough to understand the proposed design by looking at the code. I think it might be helpful to write down a bit more at a high level how this would work. I'd aim for something that you'd send to a new developer to help explain the system. (If it's sufficiently complex that this would be hard, then it's a sign that it might not be a good design). In general, my bias is to be very wary of adding a new dimension of dispatching. Currently I'm trying to think about how to extend dask-expr to arrays, and I'd be quite concened if adding a system like this made it harder to think about expressions such that that arrays became harder to achieve. |
Thanks for taking a look @mrocklin - I added a quick overview, but may be able to revise/simplify later. Overall, I'm aiming for something that is very light-weight, and mostly "invisible" within dask-expr itself. I think we are on the same page in terms of wanting to avoid complexity. I spent a lot of time struggling with this, because it is becoming apparent to me that the current collection API will need a lot of ugly compatibility code if we don't break out cudf-based collections into a distinct API (like dask-cudf). However, even if we do something like the |
|
I am a bit concerned about all these dimensions that you could end up registering dispatching mechanisms, but I don't see a better way of achieving what you are aiming for. |
Right - I think it is worth mentioning that it may be possible to use a simple |
While pushing on dask-expr + cudf support, I noticed that we are still pretty far from the API coverage of dask-cudf.
Dask-cudf currently relies on several forms of “dispatching" to smooth over differences between cudf and pandas:
DataFrame/Series/IndexclassesI was hoping to avoid (or at least delay) the need for anything new in dask-expr, and so we have almost entirely leaned on the existing dispatching code. However, in order to support differences in sorting, groupby aggregation, and shuffling behavior, we now need to either (a) put in a lot of work to align the existing algorithms, or (b) add something similar to (1) as well.
Since dask-expr also uses an expansive
Exprclass structure under the hood, I am also thinking that we need general class-level dispatching instead of simpleDataFrame/Series/Indexsubclassing. For example, there are probably several areas where we just want to "lower" an expression differently for cudf, and don't really need to redefine the collection-level API. For cases like this, we can just register a custom class dispatch for the problematic expression, and override the_lowermethod only.This POC illustrates one possible solution for a general dispatch system in which any user or external library can effectively override a collection or expression class for a specific metadata type. Note that I am trying hard to design something as simple as possible, so suggestions are very welcome.
How it's used
From the user perspective, the proposed design is probably as simple as it gets: You are allowed to register a custom dispatch for any
FrameBaseorExprclass by decorating your code with@<class name>.register_dispatch(<meta type>). For example, if I want to define my own cudf-specificDataFramecollection, I can just do something like:If the special behavior you want is more specific to a single
Exprsubclass, you can register a custom dispatch for that class in the same way. For example, if you just want cudf to do something different forgroupby.agg, you can do something like:Important Note: For typical dask-expr development with a pandas backend, nothing should change as a result of this PR.
How it works
The new system requires very few changes to
FrameBaseandExpr, because (1) it leverages the existing type-dispatching logic indask.utils.Dispatch, and (2) it does not require any additional code to register the defaultFrameBase/Exprclasses that are already defined in dask-expr.Registering of dispatch classes works in the same way that dispatch functions are registered in dask/dask. When
base_cls.register_dispatch(meta_type, custom_cls)is used to register a custom class for a (base_class,meta_type) combination, we essentially add the custom class to aDict[base_class, dask.utils.Dispatch]structure (where thedask.utils.Dispatchelement handles the type-based dispatching).At runtime, class dispatching happens within
cls.__new__, which is called before object initialization. Since we can inspect the initialization arguments withincls.__new__, we can also check if the first argument (arg) is anExprinstance, and if we have a custom class registered for the combination ofclsandarg._meta. If we do have something registered, we simply use the custom class to create our object instead ofcls. This means that our custom class must accept the same__init__arguments ascls, but attributes like_meta/_lower/etc can all be different.