Skip to content

[Data] Dataset.unique() raises error in case of any null values #42142

@bdewilde

Description

@bdewilde

What happened + What you expected to happen

I wanted to get the unique values in a given column of my dataset, but some of the values are null for unavoidable reasons. Calling Dataset.unique(colname) on such data raises a TypeError, with differing specifics depending on how the column dtype is specified. This behavior was surprising since the equivalent operation on a pandas.Series works just fine, as does getting unique values via Python built-ins.

Here are two versions of type error I got, seemingly from the same line of code:

File ~/.pyenv/versions/3.9.18/envs/ev-detection/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
    107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
    108 # Compute sorted indices of the samples. In np.lexsort last key is the
    109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
    111 # Sort each column by indices, and calculate q-ths quantile items.
    112 # Ignore the 1st item as it's not required for the boundary
    113 for k, v in sample_dict.items():

File <__array_function__ internals>:180, in lexsort(*args, **kwargs)

TypeError: '<' not supported between instances of 'NoneType' and 'int'

and

File ~/.pyenv/versions/3.9.18/envs/test-env/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
    107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
    108 # Compute sorted indices of the samples. In np.lexsort last key is the
    109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
    111 # Sort each column by indices, and calculate q-ths quantile items.
    112 # Ignore the 1st item as it's not required for the boundary
    113 for k, v in sample_dict.items():

File <__array_function__ internals>:180, in lexsort(*args, **kwargs)

File missing.pyx:419, in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous

Versions / Dependencies

macOS 14.1
PY 3.9
ray == 2.9.0
pandas == 2.1.0

Reproduction script

import pandas as pd
import ray.data

items = [1, 2, 3, 2, 3, None]
# set(items) works fine, as expected
ds1 = ray.data.from_items(items)
ds1.unique("item")
# raises TypeError: '<' not supported between instances of 'NoneType' and 'int'

df = pd.DataFrame({"col": [1, 2, 3, None]}, dtype="Int64")
# df["col"].unique() works fine, as expected
ds2 = ray.data.from_pandas(df)
ds2.unique("col")
# raises TypeError: boolean value of NA is ambiguous

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tdataRay Data-related issuesgood-first-issueGreat starter issue for someone just starting to contribute to Ray

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions