Skip to content

Add normalize_token dispatch for cudf objects#10398

Closed
rjzamora wants to merge 3 commits intodask:mainfrom
rjzamora:normalize-token-cudf
Closed

Add normalize_token dispatch for cudf objects#10398
rjzamora wants to merge 3 commits intodask:mainfrom
rjzamora:normalize-token-cudf

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Jul 5, 2023

As discussed in dask/dask-expr#80, tokenize does not return a deterministic result for cudf objects. In order for dask-expr to "work" with cudf-backed data, we need the same cudf.DataFrame object to always return the same token.

This PR currently leverages the hash_values method to avoid moving all of the cudf data to pandas for tokenization. I cannot think of a "proper" way to define normalize_token such that we don't need to move data from device to host. One possibility is to tokenize str(obj._data._data), which returns something like:

{'A': <cudf.core.column.numerical.NumericalColumn object at 0x7f12880c17c0>
[
  0.8953784288214444,
  0.08755770229686265,
  0.34997287209485844,
  -0.3932070666454323,
  -0.48936633276402525,
  -0.27516736628435173,
  0.345745552676493,
  -0.8052518333730919,
  -0.724573970398181,
  1.417117733900153,
  ...
  1.964413734663649,
  1.4461948823430957,
  0.3597774372003223,
  -1.8124063537988735,
  1.3471233020721483,
  0.06658665344959998,
  -2.407065477016253,
  -0.08984276519450277,
  0.0563293241722997,
  -0.3651402969564298
]
dtype: float64, 'B': <cudf.core.column.numerical.NumericalColumn object at 0x7f12880c1740>
[
  -0.6808073065312824,
  0.7509298536770876,
  1.0138677714674298,
  0.9697560553365036,
  -0.6810526127554865,
  2.2993594506803046,
  1.0055211903308015,
  1.4056699179705736,
  0.27700903648754815,
  -0.6962352402999742,
  ...
  1.504129139928046,
  -1.07522259421438,
  -0.3427721142143722,
  0.15845171832180857,
  -0.15321560790625618,
  0.13699003853682115,
  -0.0021965876556498997,
  -1.0362483192703498,
  0.8643568908318128,
  -0.002322298244040977
]
dtype: float64}

This approach doesn't account for every row/value, but does account for the schema and data-buffer locations (maybe this is good enough?).

Perhaps @galipremsagar has some ideas :)

@rjzamora rjzamora added dataframe bug Something is broken gpu labels Jul 5, 2023
@github-actions github-actions bot added the dispatch Related to `Dispatch` extension objects label Jul 5, 2023
@rjzamora
Copy link
Member Author

I'm going to close this in favor of rapidsai/cudf#13692, since I expect that we will probably want to iterate on the exact logic in cudf/dask_cudf.

@rjzamora rjzamora closed this Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is broken dataframe dispatch Related to `Dispatch` extension objects gpu

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant