Skip to content

Enable deterministic tokenization for cudf objects in dask#13692

Closed
rjzamora wants to merge 8 commits intorapidsai:branch-23.08from
rjzamora:normalize-token-cudf
Closed

Enable deterministic tokenization for cudf objects in dask#13692
rjzamora wants to merge 8 commits intorapidsai:branch-23.08from
rjzamora:normalize-token-cudf

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Jul 12, 2023

Description

As discussed in dask/dask#6718, dask/dask-expr#80, and dask/dask#10398, cudf-backed objects do not produce a deterministic hash in dask.base.tokenize.

This PR registers normalize_token logic to manually hash rows and move cudf-based data to pandas before the final "tokenization" is performed. The result is that tokenize(<cudf-data>) is deterministic.

Important Detail:
Since this solution requires a device-to-host transfer, the specialized logic is only triggered when the "tokenize.ensure-deterministic" configuration option is explicitly set to True (in which case the default tokenization path would fail for cudf-backed data anyway).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. dask Dask issue non-breaking Non-breaking change helps: Dask labels Jul 12, 2023
@rjzamora rjzamora self-assigned this Jul 12, 2023
@rjzamora rjzamora requested a review from a team as a code owner July 12, 2023 17:50
@rjzamora rjzamora added the improvement Improvement / enhancement to an existing function label Jul 12, 2023
@rjzamora
Copy link
Member Author

cc @wence-

@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 13, 2023
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks OK, until we have some fast GPU hashing. But do we want determinism by default even if it's not requested in the dask config?

@normalize_token.register(cudf.core.frame.Frame)
def _normalize_cudf_frame(obj):
# Only move data to pandas if we "need" to
if config.get("tokenize.ensure-deterministic", False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure-deterministic is not the default in dask, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is False by default in dask.dask, but is now True in dask-expr. I'm slightly hesitant to add a new D->H transfer to dask-cudf workflows having from_pandas/cudf. Although I doubt the overhead will be significant, I saw this config as a pathway to punt on that decision for now.

@rjzamora
Copy link
Member Author

But do we want determinism by default even if it's not requested in the dask config?

Yes, we may want to use the hash_values approach more generally, but I'd prefer to make that change in a follow-up. After I've had time to test performance implications a bit.

@rjzamora
Copy link
Member Author

rjzamora commented Jul 13, 2023

but I'd prefer to make that change in a follow-up

On second thought, I am now thinking it may be best to avoid this "waiting period".

I'm also thinking this may be a great opportunity to solve the cudf vs dask_cudf dispatching problem (the problem being that dispatching logic for cudf is implemented in dask_cudf, and not all users understand if/when they need dask_cudf). That is, we can use the __dask_tokenize__ approach within cudf, instead of registering a normalize_token dispatch. The advantage to doing this is that we can also import dask_cudf (when available) within the __dask_tokenize__ definition to make sure we are registering the necessary dispatch functions for dask-dataframe/dask-expr to properly operate with a cudf backend.

Note that I am currently experimenting with this idea in https://github.com/rjzamora/cudf/tree/normalize-token-cudf-2

Any thoughts or opinions on this @wence- ?

EDIT: I submitted #13695 to demonstrate this alternative.

@rjzamora
Copy link
Member Author

Closing in favor of #13695

@rjzamora rjzamora closed this Jul 18, 2023
rapids-bot bot pushed a commit that referenced this pull request Jul 18, 2023
Possible alternative to #13692. See [this comment](#13692 (comment))

General motivation: We need deterministic tokenization in dask-expr. These changes provide what we need. We also need dask-cudf to be imported whenever we are using a `cudf` backend in `dask.dataframe` or `dask_expr`. These changes ensure that `dask_cudf` is imported (when available).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #13695
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team dask Dask issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants