Add ``normalize_token`` dispatch for cudf objects by rjzamora · Pull Request #10398 · dask/dask

rjzamora · 2023-07-05T17:56:42Z

As discussed in dask/dask-expr#80, tokenize does not return a deterministic result for cudf objects. In order for dask-expr to "work" with cudf-backed data, we need the same cudf.DataFrame object to always return the same token.

This PR currently leverages the hash_values method to avoid moving all of the cudf data to pandas for tokenization. I cannot think of a "proper" way to define normalize_token such that we don't need to move data from device to host. One possibility is to tokenize str(obj._data._data), which returns something like:

{'A': <cudf.core.column.numerical.NumericalColumn object at 0x7f12880c17c0>
[
  0.8953784288214444,
  0.08755770229686265,
  0.34997287209485844,
  -0.3932070666454323,
  -0.48936633276402525,
  -0.27516736628435173,
  0.345745552676493,
  -0.8052518333730919,
  -0.724573970398181,
  1.417117733900153,
  ...
  1.964413734663649,
  1.4461948823430957,
  0.3597774372003223,
  -1.8124063537988735,
  1.3471233020721483,
  0.06658665344959998,
  -2.407065477016253,
  -0.08984276519450277,
  0.0563293241722997,
  -0.3651402969564298
]
dtype: float64, 'B': <cudf.core.column.numerical.NumericalColumn object at 0x7f12880c1740>
[
  -0.6808073065312824,
  0.7509298536770876,
  1.0138677714674298,
  0.9697560553365036,
  -0.6810526127554865,
  2.2993594506803046,
  1.0055211903308015,
  1.4056699179705736,
  0.27700903648754815,
  -0.6962352402999742,
  ...
  1.504129139928046,
  -1.07522259421438,
  -0.3427721142143722,
  0.15845171832180857,
  -0.15321560790625618,
  0.13699003853682115,
  -0.0021965876556498997,
  -1.0362483192703498,
  0.8643568908318128,
  -0.002322298244040977
]
dtype: float64}

This approach doesn't account for every row/value, but does account for the schema and data-buffer locations (maybe this is good enough?).

Perhaps @galipremsagar has some ideas :)

rjzamora · 2023-07-12T17:56:23Z

I'm going to close this in favor of rapidsai/cudf#13692, since I expect that we will probably want to iterate on the exact logic in cudf/dask_cudf.

add normalize_token dispatch for cudf objects

dcfed3c

rjzamora added dataframe bug Something is broken gpu labels Jul 5, 2023

github-actions bot added the dispatch Related to `Dispatch` extension objects label Jul 5, 2023

Merge remote-tracking branch 'upstream/main' into normalize-token-cudf

99b24ca

rjzamora mentioned this pull request Jul 5, 2023

Support cudf as a DataFrame backend dask/dask-expr#212

Merged

Merge remote-tracking branch 'upstream/main' into normalize-token-cudf

020ee1d

rjzamora mentioned this pull request Jul 12, 2023

Enable deterministic tokenization for cudf objects in dask rapidsai/cudf#13692

Closed

3 tasks

rjzamora closed this Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `normalize_token` dispatch for cudf objects#10398

Add `normalize_token` dispatch for cudf objects#10398
rjzamora wants to merge 3 commits intodask:mainfrom
rjzamora:normalize-token-cudf

rjzamora commented Jul 5, 2023

Uh oh!

rjzamora commented Jul 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rjzamora commented Jul 5, 2023

Uh oh!

rjzamora commented Jul 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant