[Bug] Non-deterministic `tokenize` for `cudf` and `pyarrow` objects

As discussed in an older Dask issue (https://github.com/dask/dask/issues/6718), GPU objects do not produce a deterministic hash within `tokenize`. This is a problem for dask-expr, because we assume that an expression can be uniquely identified by the result of `tokenize(*operands)` (where `operands` *may* include GPU objects). While experimenting with this, I also discovered that some pyarrow objects (e.g. `pa.Table`) also produce a non-deterministic token.

In the long term, this can be resolved upstream in `cudf` by implementing deterministic object hashing. We can also leverage the `normalize_token` dispatch to convert cudf objects into a smaller (but representative) tuple of pandas/numpy objects:

```python
from dask.base import normalize_token
import cudf

@normalize_token.register((cudf.DataFrame, cudf.Series, cudf.Index))
def _tokenize_cudf_object(obj):
    # Convert cudf objects to representative pd/np data
    return (
        type(obj),  # Type info
        (
            obj.to_frame()
            if not isinstance(obj, cudf.DataFrame)
            else obj
        ).dtypes,  # Schema info
        (
            obj.to_frame()
            if isinstance(obj, cudf.Index)
            else obj.reset_index()
        ).hash_values().to_numpy(),  # Hashed rows
    )
```

Although this kind of solution requires device -> host data movement to convert the hashed-row series into numpy, the amount of data that needs to move will be minimal in performance-sensitive cases, because users should **not** be using an API like `from_pandas`.

**Overall Question**: "How do we want to deal with non-deterministic tokenization in general?"  That is, will we rely on `normalize_token` dispatching? If so, should we add the dispatch implementations in dask/dask, or should we provide a space in dask-expr?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Non-deterministic `tokenize` for `cudf` and `pyarrow` objects #80

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Non-deterministic tokenize for cudf and pyarrow objects #80

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] Non-deterministic `tokenize` for `cudf` and `pyarrow` objects #80