-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issuesgood-first-issueGreat starter issue for someone just starting to contribute to RayGreat starter issue for someone just starting to contribute to Ray
Description
What happened + What you expected to happen
I wanted to get the unique values in a given column of my dataset, but some of the values are null for unavoidable reasons. Calling Dataset.unique(colname) on such data raises a TypeError, with differing specifics depending on how the column dtype is specified. This behavior was surprising since the equivalent operation on a pandas.Series works just fine, as does getting unique values via Python built-ins.
Here are two versions of type error I got, seemingly from the same line of code:
File ~/.pyenv/versions/3.9.18/envs/ev-detection/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
108 # Compute sorted indices of the samples. In np.lexsort last key is the
109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
111 # Sort each column by indices, and calculate q-ths quantile items.
112 # Ignore the 1st item as it's not required for the boundary
113 for k, v in sample_dict.items():
File <__array_function__ internals>:180, in lexsort(*args, **kwargs)
TypeError: '<' not supported between instances of 'NoneType' and 'int'
and
File ~/.pyenv/versions/3.9.18/envs/test-env/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
108 # Compute sorted indices of the samples. In np.lexsort last key is the
109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
111 # Sort each column by indices, and calculate q-ths quantile items.
112 # Ignore the 1st item as it's not required for the boundary
113 for k, v in sample_dict.items():
File <__array_function__ internals>:180, in lexsort(*args, **kwargs)
File missing.pyx:419, in pandas._libs.missing.NAType.__bool__()
TypeError: boolean value of NA is ambiguous
Versions / Dependencies
macOS 14.1
PY 3.9
ray == 2.9.0
pandas == 2.1.0
Reproduction script
import pandas as pd
import ray.data
items = [1, 2, 3, 2, 3, None]
# set(items) works fine, as expected
ds1 = ray.data.from_items(items)
ds1.unique("item")
# raises TypeError: '<' not supported between instances of 'NoneType' and 'int'
df = pd.DataFrame({"col": [1, 2, 3, None]}, dtype="Int64")
# df["col"].unique() works fine, as expected
ds2 = ray.data.from_pandas(df)
ds2.unique("col")
# raises TypeError: boolean value of NA is ambiguousIssue Severity
Medium: It is a significant difficulty but I can work around it.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issuesgood-first-issueGreat starter issue for someone just starting to contribute to RayGreat starter issue for someone just starting to contribute to Ray