[Data] Dataset.unique() raises error in case of any null values

### What happened + What you expected to happen

I wanted to get the unique values in a given column of my dataset, but some of the values are null for unavoidable reasons. Calling `Dataset.unique(colname)` on such data raises a TypeError, with differing specifics depending on how the column dtype is specified. This behavior was surprising since the equivalent operation on a `pandas.Series` works just fine, as does getting unique values via Python built-ins.

Here are two versions of type error I got, seemingly from the same line of code:

```
File ~/.pyenv/versions/3.9.18/envs/ev-detection/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
    107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
    108 # Compute sorted indices of the samples. In np.lexsort last key is the
    109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
    111 # Sort each column by indices, and calculate q-ths quantile items.
    112 # Ignore the 1st item as it's not required for the boundary
    113 for k, v in sample_dict.items():

File <__array_function__ internals>:180, in lexsort(*args, **kwargs)

TypeError: '<' not supported between instances of 'NoneType' and 'int'
```
and

```
File ~/.pyenv/versions/3.9.18/envs/test-env/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
    107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
    108 # Compute sorted indices of the samples. In np.lexsort last key is the
    109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
    111 # Sort each column by indices, and calculate q-ths quantile items.
    112 # Ignore the 1st item as it's not required for the boundary
    113 for k, v in sample_dict.items():

File <__array_function__ internals>:180, in lexsort(*args, **kwargs)

File missing.pyx:419, in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous
```

### Versions / Dependencies

macOS 14.1
PY 3.9
ray == 2.9.0
pandas == 2.1.0

### Reproduction script

```python
import pandas as pd
import ray.data

items = [1, 2, 3, 2, 3, None]
# set(items) works fine, as expected
ds1 = ray.data.from_items(items)
ds1.unique("item")
# raises TypeError: '<' not supported between instances of 'NoneType' and 'int'

df = pd.DataFrame({"col": [1, 2, 3, None]}, dtype="Int64")
# df["col"].unique() works fine, as expected
ds2 = ray.data.from_pandas(df)
ds2.unique("col")
# raises TypeError: boolean value of NA is ambiguous
```

### Issue Severity

Medium: It is a significant difficulty but I can work around it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Dataset.unique() raises error in case of any null values #42142

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data] Dataset.unique() raises error in case of any null values #42142

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions