[Data] Allow Unique and ApproximateTopK to encode list values#58538
[Data] Allow Unique and ApproximateTopK to encode list values#58538kyuds wants to merge 5 commits intoray-project:masterfrom
Unique and ApproximateTopK to encode list values#58538Conversation
There was a problem hiding this comment.
Code Review
The pull request introduces the ability for Unique and ApproximateTopK aggregations to handle list values, either by encoding individual list elements or by treating entire lists as single objects. This is a valuable enhancement for data processing flexibility. However, there appears to be a logical inconsistency in the Unique aggregation's implementation when encode_lists is False. The current code converts each element within the lists to a string and then finds unique values among these stringified elements, which effectively flattens the lists. This contradicts the stated intent in the docstring to "encode whole lists (i.e., the entire list is considered as a single object)". The test case test_unique_for_list_elements also seems to expect this behavior of treating whole lists as unique objects, but the implementation for encode_lists=False does not achieve this. Additionally, one test case test_approximate_topk_list_encode is incorrectly placed under TestApproximateTopK but uses the Unique aggregation.
| ds = ray.data.from_items(data) | ||
| result = ds.aggregate(Unique(on="id", encode_lists=False)) | ||
|
|
||
| assert sorted(result["unique(id)"])[0] == "['a', 'a', 'a', 'b']" |
There was a problem hiding this comment.
This assertion expects the string representation of the entire list ['a', 'a', 'a', 'b'] to be a unique item. However, due to the bug in Unique.aggregate_block when encode_lists=False, the implementation currently flattens the list and finds unique elements.
If the bug in Unique.aggregate_block is fixed to correctly encode whole lists as single objects, this assertion should pass. If the bug is not fixed, this test will fail or pass incorrectly depending on the exact behavior of pc.unique on the flattened elements.
There was a problem hiding this comment.
this is actually as written down in the documentation
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: kyuds <kyuseung1016@gmail.com>
| ] | ||
| col = pa.array(str_list, type=pa.string()) | ||
|
|
||
| return pc.unique(col).to_pylist() |
There was a problem hiding this comment.
Bug: Null values lost in aggregate processing.
When ignore_nulls=False, the aggregate_block method incorrectly drops null values because pc.unique() excludes nulls by default. This affects both the encode_lists=True path (after pc.list_flatten()) and the encode_lists=False path (after converting to string array). The method needs to handle nulls explicitly when ignore_nulls=False to preserve them in the unique values output.
There was a problem hiding this comment.
this is simply not true (that pc.unique() excludes null values). Refer to custom_agg test cases below, and also documentation: https://arrow.apache.org/docs/python/generated/pyarrow.compute.unique.html
There was a problem hiding this comment.
I think it might surprise users if unique converts lists to strings.
>>> ray.data.from_items([{"id": [1, 2, 3]}]).unique("id")
[[1, 2, 3]] # Expected
>>> ray.data.from_items([{"id": [1, 2, 3]}]).unique("id")
['[1, 2, 3]'] # With current implementation My understanding is that we're converting the list to a string because pc.unique doesn't work with list types. Are there any other approaches we considered?
There was a problem hiding this comment.
Also, it important for Unique to support encode_lists in isolation, or is it just so that we can use it for other preprocessors like LabelEncoder? If it's the latter, one option might be to not add encode_lists to Unique, and convert lists to strings in LabelEncoder (and other relevant implementations)
# Possible LabelEncoder implementation pseudocode
ds.map_batches(convert_lists_to_strings).unique(list_key)There was a problem hiding this comment.
I had a couple of followups here. Imo, it would make sense to make encode_lists to always be True and ideally not expose it to users, especially since stringifying the result seems rather unintuitive.
Also unique seems to fail this case as well:
pyarrow.Table
col1: large_list<item: large_list<item: int64>>
child 0, item: large_list<item: int64>
child 0, item: int64
----
col1: [[[[1,2,3]],[[1,2,3]]]]
One option is to use https://arrow.apache.org/docs/python/compute.html#user-defined-functions (Custom UDF kernel) to address these corner cases.
There was a problem hiding this comment.
Addressing comments:
- Originally, I used
strmainly becauseApproximateTopKusesstr(because for the encoders, it is useful to actually have same datatype outputs for both aggregators). HOWEVER, looking through pyarrow, it’s possible to pickle dump/undump, so I think we can preserve original types. - Mainly supporting
encode_listsforUniquebecause of the encoders. If we want to actually move the list encoding part to the encoders, thats also fine with me, these are just design considerations. - Actually need it to work on whole lists for encoders. I think the pickle dump/undump for serialization should help in this scenario.
| ] | ||
| col = pa.array(str_list, type=pa.string()) | ||
|
|
||
| return pc.unique(col).to_pylist() |
There was a problem hiding this comment.
I think it might surprise users if unique converts lists to strings.
>>> ray.data.from_items([{"id": [1, 2, 3]}]).unique("id")
[[1, 2, 3]] # Expected
>>> ray.data.from_items([{"id": [1, 2, 3]}]).unique("id")
['[1, 2, 3]'] # With current implementation My understanding is that we're converting the list to a string because pc.unique doesn't work with list types. Are there any other approaches we considered?
| ] | ||
| col = pa.array(str_list, type=pa.string()) | ||
|
|
||
| return pc.unique(col).to_pylist() |
There was a problem hiding this comment.
Also, it important for Unique to support encode_lists in isolation, or is it just so that we can use it for other preprocessors like LabelEncoder? If it's the latter, one option might be to not add encode_lists to Unique, and convert lists to strings in LabelEncoder (and other relevant implementations)
# Possible LabelEncoder implementation pseudocode
ds.map_batches(convert_lists_to_strings).unique(list_key)| if self._encode_lists and isinstance(py_value, list): | ||
| for item in py_value: | ||
| if item is not None: | ||
| sketch.update(str(item)) |
There was a problem hiding this comment.
I think it's okay to use str here since that's consistent with the existing implementation
…flag (#58916) ## Description Basically the same idea as #58659 So `Unique` aggregator uses `pyarrow.compute.unique` function internally. This doesn't work with non-hashable types like lists. Similar to what I did for `ApproximateTopK`, we now use pickle to serialize and deserialize elements. Other improvements: - `ignore_nulls` flag didn't work at all. This flag now properly works - Had to force `ignore_nulls=False` for datasets `unique` api for backwards compatibility (we set `ignore_nulls` to `True` by default, so behavior for datasets `unique` api will change now that `ignore_nulls` actually works) ## Related issues This PR replaces #58538 ## Additional information [Design doc on my notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link) --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com>
…flag (ray-project#58916) ## Description Basically the same idea as ray-project#58659 So `Unique` aggregator uses `pyarrow.compute.unique` function internally. This doesn't work with non-hashable types like lists. Similar to what I did for `ApproximateTopK`, we now use pickle to serialize and deserialize elements. Other improvements: - `ignore_nulls` flag didn't work at all. This flag now properly works - Had to force `ignore_nulls=False` for datasets `unique` api for backwards compatibility (we set `ignore_nulls` to `True` by default, so behavior for datasets `unique` api will change now that `ignore_nulls` actually works) ## Related issues This PR replaces ray-project#58538 ## Additional information [Design doc on my notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link) --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Related to #58450 - this is the first out of three PRs.
Previous behavior:
uniquemethod on lists.strNotes on current behavior:
frequent_strings_sketchso it force-converts everything to string (this is true for original implementation anyways), and Unique will error because pyarrow doesn't allow heterogenous list element types.Related issues
N/A
Additional information
N/A