Enable deterministic tokenization for cudf objects in dask#13692
Enable deterministic tokenization for cudf objects in dask#13692rjzamora wants to merge 8 commits intorapidsai:branch-23.08from
Conversation
|
cc @wence- |
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
wence-
left a comment
There was a problem hiding this comment.
I think this looks OK, until we have some fast GPU hashing. But do we want determinism by default even if it's not requested in the dask config?
| @normalize_token.register(cudf.core.frame.Frame) | ||
| def _normalize_cudf_frame(obj): | ||
| # Only move data to pandas if we "need" to | ||
| if config.get("tokenize.ensure-deterministic", False): |
There was a problem hiding this comment.
ensure-deterministic is not the default in dask, right?
There was a problem hiding this comment.
Right, this is False by default in dask.dask, but is now True in dask-expr. I'm slightly hesitant to add a new D->H transfer to dask-cudf workflows having from_pandas/cudf. Although I doubt the overhead will be significant, I saw this config as a pathway to punt on that decision for now.
… into normalize-token-cudf
Yes, we may want to use the |
On second thought, I am now thinking it may be best to avoid this "waiting period". I'm also thinking this may be a great opportunity to solve the Note that I am currently experimenting with this idea in https://github.com/rjzamora/cudf/tree/normalize-token-cudf-2 Any thoughts or opinions on this @wence- ? EDIT: I submitted #13695 to demonstrate this alternative. |
|
Closing in favor of #13695 |
Possible alternative to #13692. See [this comment](#13692 (comment)) General motivation: We need deterministic tokenization in dask-expr. These changes provide what we need. We also need dask-cudf to be imported whenever we are using a `cudf` backend in `dask.dataframe` or `dask_expr`. These changes ensure that `dask_cudf` is imported (when available). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #13695
Description
As discussed in dask/dask#6718, dask/dask-expr#80, and dask/dask#10398, cudf-backed objects do not produce a deterministic hash in
dask.base.tokenize.This PR registers
normalize_tokenlogic to manually hash rows and move cudf-based data to pandas before the final "tokenization" is performed. The result is thattokenize(<cudf-data>)is deterministic.Important Detail:
Since this solution requires a device-to-host transfer, the specialized logic is only triggered when the
"tokenize.ensure-deterministic"configuration option is explicitly set toTrue(in which case the default tokenization path would fail for cudf-backed data anyway).Checklist