-
-
Notifications
You must be signed in to change notification settings - Fork 26
Description
As discussed in an older Dask issue (dask/dask#6718), GPU objects do not produce a deterministic hash within tokenize. This is a problem for dask-expr, because we assume that an expression can be uniquely identified by the result of tokenize(*operands) (where operands may include GPU objects). While experimenting with this, I also discovered that some pyarrow objects (e.g. pa.Table) also produce a non-deterministic token.
In the long term, this can be resolved upstream in cudf by implementing deterministic object hashing. We can also leverage the normalize_token dispatch to convert cudf objects into a smaller (but representative) tuple of pandas/numpy objects:
from dask.base import normalize_token
import cudf
@normalize_token.register((cudf.DataFrame, cudf.Series, cudf.Index))
def _tokenize_cudf_object(obj):
# Convert cudf objects to representative pd/np data
return (
type(obj), # Type info
(
obj.to_frame()
if not isinstance(obj, cudf.DataFrame)
else obj
).dtypes, # Schema info
(
obj.to_frame()
if isinstance(obj, cudf.Index)
else obj.reset_index()
).hash_values().to_numpy(), # Hashed rows
)Although this kind of solution requires device -> host data movement to convert the hashed-row series into numpy, the amount of data that needs to move will be minimal in performance-sensitive cases, because users should not be using an API like from_pandas.
Overall Question: "How do we want to deal with non-deterministic tokenization in general?" That is, will we rely on normalize_token dispatching? If so, should we add the dispatch implementations in dask/dask, or should we provide a space in dask-expr?