Skip to content

[Data] Allow Unique and ApproximateTopK to encode list values#58538

Closed
kyuds wants to merge 5 commits intoray-project:masterfrom
kyuds:aggregate-encode-lists
Closed

[Data] Allow Unique and ApproximateTopK to encode list values#58538
kyuds wants to merge 5 commits intoray-project:masterfrom
kyuds:aggregate-encode-lists

Conversation

@kyuds
Copy link
Copy Markdown
Member

@kyuds kyuds commented Nov 11, 2025

Description

Related to #58450 - this is the first out of three PRs.

Previous behavior:

  • Unique: when list was passed, the program errors because pyarrow compute cannot run the unique method on lists.
  • ApproximateTopK: when list was passed, it was cast to string via str

Notes on current behavior:

  • Encode lists will not support heterogenous list element types (eg: lists with numbers and strings). ApproximateTopK uses frequent_strings_sketch so it force-converts everything to string (this is true for original implementation anyways), and Unique will error because pyarrow doesn't allow heterogenous list element types.

Related issues

N/A

Additional information

N/A

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
@kyuds kyuds requested a review from a team as a code owner November 11, 2025 13:59
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces the ability for Unique and ApproximateTopK aggregations to handle list values, either by encoding individual list elements or by treating entire lists as single objects. This is a valuable enhancement for data processing flexibility. However, there appears to be a logical inconsistency in the Unique aggregation's implementation when encode_lists is False. The current code converts each element within the lists to a string and then finds unique values among these stringified elements, which effectively flattens the lists. This contradicts the stated intent in the docstring to "encode whole lists (i.e., the entire list is considered as a single object)". The test case test_unique_for_list_elements also seems to expect this behavior of treating whole lists as unique objects, but the implementation for encode_lists=False does not achieve this. Additionally, one test case test_approximate_topk_list_encode is incorrectly placed under TestApproximateTopK but uses the Unique aggregation.

ds = ray.data.from_items(data)
result = ds.aggregate(Unique(on="id", encode_lists=False))

assert sorted(result["unique(id)"])[0] == "['a', 'a', 'a', 'b']"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This assertion expects the string representation of the entire list ['a', 'a', 'a', 'b'] to be a unique item. However, due to the bug in Unique.aggregate_block when encode_lists=False, the implementation currently flattens the list and finds unique elements.

If the bug in Unique.aggregate_block is fixed to correctly encode whole lists as single objects, this assertion should pass. If the bug is not fixed, this test will fail or pass incorrectly depending on the exact behavior of pc.unique on the flattened elements.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually as written down in the documentation

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 11, 2025
Signed-off-by: kyuds <kyuseung1016@gmail.com>
]
col = pa.array(str_list, type=pa.string())

return pc.unique(col).to_pylist()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Null values lost in aggregate processing.

When ignore_nulls=False, the aggregate_block method incorrectly drops null values because pc.unique() excludes nulls by default. This affects both the encode_lists=True path (after pc.list_flatten()) and the encode_lists=False path (after converting to string array). The method needs to handle nulls explicitly when ignore_nulls=False to preserve them in the unique values output.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is simply not true (that pc.unique() excludes null values). Refer to custom_agg test cases below, and also documentation: https://arrow.apache.org/docs/python/generated/pyarrow.compute.unique.html

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might surprise users if unique converts lists to strings.

>>> ray.data.from_items([{"id": [1, 2, 3]}]).unique("id")
[[1, 2, 3]]  # Expected                               
>>> ray.data.from_items([{"id": [1, 2, 3]}]).unique("id")
['[1, 2, 3]']  # With current implementation                                                                                                                                                           

My understanding is that we're converting the list to a string because pc.unique doesn't work with list types. Are there any other approaches we considered?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it important for Unique to support encode_lists in isolation, or is it just so that we can use it for other preprocessors like LabelEncoder? If it's the latter, one option might be to not add encode_lists to Unique, and convert lists to strings in LabelEncoder (and other relevant implementations)

# Possible LabelEncoder implementation pseudocode
ds.map_batches(convert_lists_to_strings).unique(list_key)

Copy link
Copy Markdown
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a couple of followups here. Imo, it would make sense to make encode_lists to always be True and ideally not expose it to users, especially since stringifying the result seems rather unintuitive.

Also unique seems to fail this case as well:

pyarrow.Table
col1: large_list<item: large_list<item: int64>>
  child 0, item: large_list<item: int64>
      child 0, item: int64
----
col1: [[[[1,2,3]],[[1,2,3]]]]

One option is to use https://arrow.apache.org/docs/python/compute.html#user-defined-functions (Custom UDF kernel) to address these corner cases.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing comments:

  1. Originally, I used str mainly because ApproximateTopK uses str (because for the encoders, it is useful to actually have same datatype outputs for both aggregators). HOWEVER, looking through pyarrow, it’s possible to pickle dump/undump, so I think we can preserve original types.
  2. Mainly supporting encode_lists for Unique because of the encoders. If we want to actually move the list encoding part to the encoders, thats also fine with me, these are just design considerations.
  3. Actually need it to work on whole lists for encoders. I think the pickle dump/undump for serialization should help in this scenario.

]
col = pa.array(str_list, type=pa.string())

return pc.unique(col).to_pylist()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might surprise users if unique converts lists to strings.

>>> ray.data.from_items([{"id": [1, 2, 3]}]).unique("id")
[[1, 2, 3]]  # Expected                               
>>> ray.data.from_items([{"id": [1, 2, 3]}]).unique("id")
['[1, 2, 3]']  # With current implementation                                                                                                                                                           

My understanding is that we're converting the list to a string because pc.unique doesn't work with list types. Are there any other approaches we considered?

]
col = pa.array(str_list, type=pa.string())

return pc.unique(col).to_pylist()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it important for Unique to support encode_lists in isolation, or is it just so that we can use it for other preprocessors like LabelEncoder? If it's the latter, one option might be to not add encode_lists to Unique, and convert lists to strings in LabelEncoder (and other relevant implementations)

# Possible LabelEncoder implementation pseudocode
ds.map_batches(convert_lists_to_strings).unique(list_key)

if self._encode_lists and isinstance(py_value, list):
for item in py_value:
if item is not None:
sketch.update(str(item))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to use str here since that's consistent with the existing implementation

@iamjustinhsu iamjustinhsu added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Nov 19, 2025
@kyuds
Copy link
Copy Markdown
Member Author

kyuds commented Nov 22, 2025

this PR is succeeded with #58659 and #58916

@kyuds kyuds closed this Nov 22, 2025
@kyuds kyuds deleted the aggregate-encode-lists branch November 22, 2025 10:00
richardliaw pushed a commit that referenced this pull request Dec 9, 2025
…flag (#58916)

## Description
Basically the same idea as #58659 

So `Unique` aggregator uses `pyarrow.compute.unique` function
internally. This doesn't work with non-hashable types like lists.
Similar to what I did for `ApproximateTopK`, we now use pickle to
serialize and deserialize elements.

Other improvements:
- `ignore_nulls` flag didn't work at all. This flag now properly works
- Had to force `ignore_nulls=False` for datasets `unique` api for
backwards compatibility (we set `ignore_nulls` to `True` by default, so
behavior for datasets `unique` api will change now that `ignore_nulls`
actually works)

## Related issues
This PR replaces #58538

## Additional information
[Design doc on my
notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link)

---------

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: kyuds <kyuseung1016@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…flag (ray-project#58916)

## Description
Basically the same idea as ray-project#58659

So `Unique` aggregator uses `pyarrow.compute.unique` function
internally. This doesn't work with non-hashable types like lists.
Similar to what I did for `ApproximateTopK`, we now use pickle to
serialize and deserialize elements.

Other improvements:
- `ignore_nulls` flag didn't work at all. This flag now properly works
- Had to force `ignore_nulls=False` for datasets `unique` api for
backwards compatibility (we set `ignore_nulls` to `True` by default, so
behavior for datasets `unique` api will change now that `ignore_nulls`
actually works)

## Related issues
This PR replaces ray-project#58538

## Additional information
[Design doc on my
notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link)

---------

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: kyuds <kyuseung1016@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants