[Data] Allow `Unique` and `ApproximateTopK` to encode list values by kyuds · Pull Request #58538 · ray-project/ray

kyuds · 2025-11-11T13:59:20Z

Description

Related to #58450 - this is the first out of three PRs.

Previous behavior:

Unique: when list was passed, the program errors because pyarrow compute cannot run the unique method on lists.
ApproximateTopK: when list was passed, it was cast to string via str

Notes on current behavior:

Encode lists will not support heterogenous list element types (eg: lists with numbers and strings). ApproximateTopK uses frequent_strings_sketch so it force-converts everything to string (this is true for original implementation anyways), and Unique will error because pyarrow doesn't allow heterogenous list element types.

Related issues

N/A

Additional information

N/A

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

python/ray/data/aggregate.py

python/ray/data/tests/test_custom_agg.py

python/ray/data/aggregate.py

gemini-code-assist

Code Review

The pull request introduces the ability for Unique and ApproximateTopK aggregations to handle list values, either by encoding individual list elements or by treating entire lists as single objects. This is a valuable enhancement for data processing flexibility. However, there appears to be a logical inconsistency in the Unique aggregation's implementation when encode_lists is False. The current code converts each element within the lists to a string and then finds unique values among these stringified elements, which effectively flattens the lists. This contradicts the stated intent in the docstring to "encode whole lists (i.e., the entire list is considered as a single object)". The test case test_unique_for_list_elements also seems to expect this behavior of treating whole lists as unique objects, but the implementation for encode_lists=False does not achieve this. Additionally, one test case test_approximate_topk_list_encode is incorrectly placed under TestApproximateTopK but uses the Unique aggregation.

python/ray/data/tests/test_custom_agg.py

gemini-code-assist · 2025-11-11T14:02:39Z

python/ray/data/tests/test_custom_agg.py

+        ds = ray.data.from_items(data)
+        result = ds.aggregate(Unique(on="id", encode_lists=False))
+
+        assert sorted(result["unique(id)"])[0] == "['a', 'a', 'a', 'b']"


This assertion expects the string representation of the entire list ['a', 'a', 'a', 'b'] to be a unique item. However, due to the bug in Unique.aggregate_block when encode_lists=False, the implementation currently flattens the list and finds unique elements.

If the bug in Unique.aggregate_block is fixed to correctly encode whole lists as single objects, this assertion should pass. If the bug is not fixed, this test will fail or pass incorrectly depending on the exact behavior of pc.unique on the flattened elements.

this is actually as written down in the documentation

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

Signed-off-by: kyuds <kyuseung1016@gmail.com>

cursor · 2025-11-12T00:19:47Z

python/ray/data/aggregate.py

+                ]
+                col = pa.array(str_list, type=pa.string())
+
+        return pc.unique(col).to_pylist()


Bug: Null values lost in aggregate processing.

When ignore_nulls=False, the aggregate_block method incorrectly drops null values because pc.unique() excludes nulls by default. This affects both the encode_lists=True path (after pc.list_flatten()) and the encode_lists=False path (after converting to string array). The method needs to handle nulls explicitly when ignore_nulls=False to preserve them in the unique values output.

this is simply not true (that pc.unique() excludes null values). Refer to custom_agg test cases below, and also documentation: https://arrow.apache.org/docs/python/generated/pyarrow.compute.unique.html

I think it might surprise users if unique converts lists to strings.

>>> ray.data.from_items([{"id": [1, 2, 3]}]).unique("id") [[1, 2, 3]] # Expected >>> ray.data.from_items([{"id": [1, 2, 3]}]).unique("id") ['[1, 2, 3]'] # With current implementation

My understanding is that we're converting the list to a string because pc.unique doesn't work with list types. Are there any other approaches we considered?

Also, it important for Unique to support encode_lists in isolation, or is it just so that we can use it for other preprocessors like LabelEncoder? If it's the latter, one option might be to not add encode_lists to Unique, and convert lists to strings in LabelEncoder (and other relevant implementations)

# Possible LabelEncoder implementation pseudocode ds.map_batches(convert_lists_to_strings).unique(list_key)

I had a couple of followups here. Imo, it would make sense to make encode_lists to always be True and ideally not expose it to users, especially since stringifying the result seems rather unintuitive.

Also unique seems to fail this case as well:

pyarrow.Table col1: large_list<item: large_list<item: int64>> child 0, item: large_list<item: int64> child 0, item: int64 ---- col1: [[[[1,2,3]],[[1,2,3]]]]

One option is to use https://arrow.apache.org/docs/python/compute.html#user-defined-functions (Custom UDF kernel) to address these corner cases.

Addressing comments:

Originally, I used str mainly because ApproximateTopK uses str (because for the encoders, it is useful to actually have same datatype outputs for both aggregators). HOWEVER, looking through pyarrow, it’s possible to pickle dump/undump, so I think we can preserve original types.

Mainly supporting encode_lists for Unique because of the encoders. If we want to actually move the list encoding part to the encoders, thats also fine with me, these are just design considerations.

Actually need it to work on whole lists for encoders. I think the pickle dump/undump for serialization should help in this scenario.

bveeramani · 2025-11-12T20:36:02Z

python/ray/data/aggregate.py

+                ]
+                col = pa.array(str_list, type=pa.string())
+
+        return pc.unique(col).to_pylist()


I think it might surprise users if unique converts lists to strings.

>>> ray.data.from_items([{"id": [1, 2, 3]}]).unique("id") [[1, 2, 3]] # Expected >>> ray.data.from_items([{"id": [1, 2, 3]}]).unique("id") ['[1, 2, 3]'] # With current implementation

My understanding is that we're converting the list to a string because pc.unique doesn't work with list types. Are there any other approaches we considered?

bveeramani · 2025-11-12T20:41:32Z

python/ray/data/aggregate.py

+                ]
+                col = pa.array(str_list, type=pa.string())
+
+        return pc.unique(col).to_pylist()


Also, it important for Unique to support encode_lists in isolation, or is it just so that we can use it for other preprocessors like LabelEncoder? If it's the latter, one option might be to not add encode_lists to Unique, and convert lists to strings in LabelEncoder (and other relevant implementations)

# Possible LabelEncoder implementation pseudocode ds.map_batches(convert_lists_to_strings).unique(list_key)

bveeramani · 2025-11-12T20:43:42Z

python/ray/data/aggregate.py

+            if self._encode_lists and isinstance(py_value, list):
+                for item in py_value:
+                    if item is not None:
+                        sketch.update(str(item))


I think it's okay to use str here since that's consistent with the existing implementation

kyuds · 2025-11-22T10:00:24Z

this PR is succeeded with #58659 and #58916

…flag (#58916) ## Description Basically the same idea as #58659 So `Unique` aggregator uses `pyarrow.compute.unique` function internally. This doesn't work with non-hashable types like lists. Similar to what I did for `ApproximateTopK`, we now use pickle to serialize and deserialize elements. Other improvements: - `ignore_nulls` flag didn't work at all. This flag now properly works - Had to force `ignore_nulls=False` for datasets `unique` api for backwards compatibility (we set `ignore_nulls` to `True` by default, so behavior for datasets `unique` api will change now that `ignore_nulls` actually works) ## Related issues This PR replaces #58538 ## Additional information [Design doc on my notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link) --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com>

…flag (ray-project#58916) ## Description Basically the same idea as ray-project#58659 So `Unique` aggregator uses `pyarrow.compute.unique` function internally. This doesn't work with non-hashable types like lists. Similar to what I did for `ApproximateTopK`, we now use pickle to serialize and deserialize elements. Other improvements: - `ignore_nulls` flag didn't work at all. This flag now properly works - Had to force `ignore_nulls=False` for datasets `unique` api for backwards compatibility (we set `ignore_nulls` to `True` by default, so behavior for datasets `unique` api will change now that `ignore_nulls` actually works) ## Related issues This PR replaces ray-project#58538 ## Additional information [Design doc on my notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link) --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

impl

bdf851e

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

kyuds requested a review from a team as a code owner November 11, 2025 13:59

cursor bot reviewed Nov 11, 2025

View reviewed changes

python/ray/data/aggregate.py Show resolved Hide resolved

python/ray/data/tests/test_custom_agg.py Outdated Show resolved Hide resolved

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 11, 2025

View reviewed changes

kyuds added 3 commits November 11, 2025 23:12

fix wrong test

ad49bc5

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

null checks

5d6b14e

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

more tests and debug ignore nulls

bc006e8

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 11, 2025

fix ignore nulls for approximate top k

0eb4ee2

Signed-off-by: kyuds <kyuseung1016@gmail.com>

cursor bot reviewed Nov 12, 2025

View reviewed changes

gvspraveen requested a review from bveeramani November 12, 2025 18:44

bveeramani reviewed Nov 12, 2025

View reviewed changes

iamjustinhsu added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Nov 19, 2025

kyuds mentioned this pull request Nov 22, 2025

[Data] Support List Types for Unique Aggregator and encode_lists flag #58916

Merged

kyuds closed this Nov 22, 2025

kyuds deleted the aggregate-encode-lists branch November 22, 2025 10:00

Conversation

kyuds commented Nov 11, 2025

Description

Related issues

Additional information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Nov 12, 2025

Choose a reason for hiding this comment

Bug: Null values lost in aggregate processing.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyuds commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

goutamvenkat-anyscale Nov 12, 2025 •

edited

Loading