Skip to content

[Bug] Non-deterministic tokenize for cudf and pyarrow objects #80

@rjzamora

Description

@rjzamora

As discussed in an older Dask issue (dask/dask#6718), GPU objects do not produce a deterministic hash within tokenize. This is a problem for dask-expr, because we assume that an expression can be uniquely identified by the result of tokenize(*operands) (where operands may include GPU objects). While experimenting with this, I also discovered that some pyarrow objects (e.g. pa.Table) also produce a non-deterministic token.

In the long term, this can be resolved upstream in cudf by implementing deterministic object hashing. We can also leverage the normalize_token dispatch to convert cudf objects into a smaller (but representative) tuple of pandas/numpy objects:

from dask.base import normalize_token
import cudf

@normalize_token.register((cudf.DataFrame, cudf.Series, cudf.Index))
def _tokenize_cudf_object(obj):
    # Convert cudf objects to representative pd/np data
    return (
        type(obj),  # Type info
        (
            obj.to_frame()
            if not isinstance(obj, cudf.DataFrame)
            else obj
        ).dtypes,  # Schema info
        (
            obj.to_frame()
            if isinstance(obj, cudf.Index)
            else obj.reset_index()
        ).hash_values().to_numpy(),  # Hashed rows
    )

Although this kind of solution requires device -> host data movement to convert the hashed-row series into numpy, the amount of data that needs to move will be minimal in performance-sensitive cases, because users should not be using an API like from_pandas.

Overall Question: "How do we want to deal with non-deterministic tokenization in general?" That is, will we rely on normalize_token dispatching? If so, should we add the dispatch implementations in dask/dask, or should we provide a space in dask-expr?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions