Backend library dispatching for IO in Dask-Array and Dask-DataFrame#9475
Backend library dispatching for IO in Dask-Array and Dask-DataFrame#9475
Conversation
jrbourbeau
left a comment
There was a problem hiding this comment.
Apologies, I wasn't able to get to a detailed review on this today. I have a few non-blocking comments on the .rst doc added here but, as they're non-blocking, I'd prefer to just submit a follow-up PR for folks to review. @rjzamora @ian-r-rose, unless I'm needed for something, I'll defer to you two as to when this PR is good to merge.
I think the "public" component of this PR is ready to go, and the internal pieces can be cleaned up if/when needed. However, I'm a bit biased, so I'll gladly defer to the opinions of @ian-r-rose and @wence- :) |
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
|
It looks like we're seeing conflicts with #9283, which was just merged, and adds an entrypoint section. If you merge |
Ah-ha! I was getting very confused. Makes sense. |
|
Note: This branch includes the necessary changes to merge #9038 with this PR. |
Nice! That was quick |
ian-r-rose
left a comment
There was a problem hiding this comment.
I'm happy with where this stands now, thanks for all your work on this @rjzamora!
|
Thanks for the tremendous help here @ian-r-rose @wence- @jrbourbeau ! Please do ping me if this causes any unexpected problems :) |
This PR depends on dask/dask#9475 (**Now Merged**) After dask#9475, external libraries are now able to implement (and expose) their own `DataFrameBackendEntrypoint` definitions to specify custom creation functions for DataFrame collections. This PR introduces the `CudfBackendEntrypoint` class to create `dask_cudf.DataFrame` collections using the `dask.dataframe` API. By installing `dask_cudf` with this entrypoint definition in place, you get the following behavior in `dask.dataframe`: ```python import dask.dataframe as dd import dask # Tell Dask that you want to create DataFrame collections # with the "cudf" backend (for supported creation functions). # This can also be used in a context, or set in a yaml file dask.config.set({"dataframe.backend": "cudf"}) ddf = dd.from_dict({"a": range(10)}, npartitions=2) type(ddf) # dask_cudf.core.DataFrame ``` Note that the code snippet above does not require an explicit import of `cudf` or `dask_cudf`. The following creation functions will support backend dispatching after dask#9475: - `from_dict` - `read_paquet` - `read_json` - `read_orc` - `read_csv` - `read_hdf` See also: dask/design-docs#1 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11920
This PR serves as a reference implementation for (what is expected to be) "Dask Design Proposal 1": dask/design-docs#1
Required cudf branch for Dask-DataFrame dispatching to cudf: https://github.com/rjzamora/cudf/tree/backend-class