[Data] Support Non-String Items for `ApproximateTopK` Aggregator by kyuds · Pull Request #58659 · ray-project/ray

kyuds · 2025-11-15T14:02:54Z

Description

Internally, the ApproximateTopK aggregator uses frequent_strings_sketch to implement efficient top-k calculations. As hinted in the name frequent_strings_sketch, the current implementation casts all data to string before inputting it into the sketch, so the output data is also in string.

Therefore, when we have numeric data, for instance, we would get:

[{"id": "1", "count": 5} ... ]  # notice 1 is not an integer, but string

instead of

[{"id": 1, "count": 5} ... ]

which would be expected.

Other types, like lists, tuples, etc will also be cast to string, making it hard for users to recover data.

This PR (with offline discussion with some Ray Data team members) attempts to use the pickle library to pickle and unpickle data so that when you input the data to frequent_strings_sketch, you insert the hex string of the pickle dump.

As further improvements, this PR also supports encode_lists flag to encode individual list values. This will be useful for our encoders (specifically MultiHotEncoder and OrdinalEncoder) in the future.

Related issues

N/A

Additional information

N/A

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

kyuds · 2025-11-15T14:03:07Z

PTAL @owenowenisme @bveeramani

gemini-code-assist

Code Review

This pull request is a great improvement, adding support for non-string types in the ApproximateTopK aggregator by using pickle for serialization. This correctly preserves data types like integers and lists, which were previously cast to strings. The implementation is clean and the addition of the encode_lists flag provides useful flexibility for handling list elements. I've identified a critical issue in the new tests where assertions are incorrect due to what appears to be a copy-paste error. My review includes a code suggestion to fix this.

python/ray/data/tests/test_custom_agg.py

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

python/ray/data/aggregate.py

Signed-off-by: kyuds <kyuseung1016@gmail.com>

python/ray/data/tests/test_custom_agg.py

bveeramani · 2025-11-18T01:26:55Z

python/ray/data/aggregate.py

+
        return [
-            {self.get_target_column(): str(item[0]), "count": int(item[1])}
+            {column: pickle.loads(bytes.fromhex(str(item[0]))), "count": int(item[1])}


For my own understanding, what is the type of item[0]? Why do we need to convert it to a string first?

this was actually there due to the original implementation. I didnt really put much thought into it, but after investigation, there is no need to convert to string as it already is. Fixed accordingly, though I will say that the datasketches library is quite poorly documented

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

cursor · 2025-11-18T13:45:12Z

python/ray/data/aggregate.py

+
        return [
-            {self.get_target_column(): str(item[0]), "count": int(item[1])}
+            {column: pickle.loads(bytes.fromhex(item[0])), "count": item[1]}


Bug: Restore explicit integer casting for counts.

The count value is not explicitly cast to int. The previous implementation at line 1549 used int(item[1]), but the new code at line 1550 omits this cast. The datasketches library may return a non-Python int type (e.g., numpy int), and the explicit cast ensures type consistency with the original behavior and expectations.

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

bveeramani

Thank you for the contribution!!

…-project#58659) ## Description Internally, the `ApproximateTopK` aggregator uses `frequent_strings_sketch` to implement efficient top-k calculations. As hinted in the name `frequent_strings_sketch`, the current implementation casts all data to string before inputting it into the sketch, so the output data is also in string. Therefore, when we have numeric data, for instance, we would get: ``` [{"id": "1", "count": 5} ... ] # notice 1 is not an integer, but string ``` instead of ``` [{"id": 1, "count": 5} ... ] ``` which would be expected. Other types, like lists, tuples, etc will also be cast to string, making it hard for users to recover data. This PR (with offline discussion with some Ray Data team members) attempts to use the `pickle` library to pickle and unpickle data so that when you input the data to `frequent_strings_sketch`, you insert the hex string of the pickle dump. As further improvements, this PR also supports `encode_lists` flag to encode individual list values. This will be useful for our encoders (specifically `MultiHotEncoder` and `OrdinalEncoder`) in the future. ## Related issues N/A ## Additional information N/A --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com>

…-project#58659) ## Description Internally, the `ApproximateTopK` aggregator uses `frequent_strings_sketch` to implement efficient top-k calculations. As hinted in the name `frequent_strings_sketch`, the current implementation casts all data to string before inputting it into the sketch, so the output data is also in string. Therefore, when we have numeric data, for instance, we would get: ``` [{"id": "1", "count": 5} ... ] # notice 1 is not an integer, but string ``` instead of ``` [{"id": 1, "count": 5} ... ] ``` which would be expected. Other types, like lists, tuples, etc will also be cast to string, making it hard for users to recover data. This PR (with offline discussion with some Ray Data team members) attempts to use the `pickle` library to pickle and unpickle data so that when you input the data to `frequent_strings_sketch`, you insert the hex string of the pickle dump. As further improvements, this PR also supports `encode_lists` flag to encode individual list values. This will be useful for our encoders (specifically `MultiHotEncoder` and `OrdinalEncoder`) in the future. ## Related issues N/A ## Additional information N/A --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>

…-project#58659) ## Description Internally, the `ApproximateTopK` aggregator uses `frequent_strings_sketch` to implement efficient top-k calculations. As hinted in the name `frequent_strings_sketch`, the current implementation casts all data to string before inputting it into the sketch, so the output data is also in string. Therefore, when we have numeric data, for instance, we would get: ``` [{"id": "1", "count": 5} ... ] # notice 1 is not an integer, but string ``` instead of ``` [{"id": 1, "count": 5} ... ] ``` which would be expected. Other types, like lists, tuples, etc will also be cast to string, making it hard for users to recover data. This PR (with offline discussion with some Ray Data team members) attempts to use the `pickle` library to pickle and unpickle data so that when you input the data to `frequent_strings_sketch`, you insert the hex string of the pickle dump. As further improvements, this PR also supports `encode_lists` flag to encode individual list values. This will be useful for our encoders (specifically `MultiHotEncoder` and `OrdinalEncoder`) in the future. ## Related issues N/A ## Additional information N/A --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com>

…flag (#58916) ## Description Basically the same idea as #58659 So `Unique` aggregator uses `pyarrow.compute.unique` function internally. This doesn't work with non-hashable types like lists. Similar to what I did for `ApproximateTopK`, we now use pickle to serialize and deserialize elements. Other improvements: - `ignore_nulls` flag didn't work at all. This flag now properly works - Had to force `ignore_nulls=False` for datasets `unique` api for backwards compatibility (we set `ignore_nulls` to `True` by default, so behavior for datasets `unique` api will change now that `ignore_nulls` actually works) ## Related issues This PR replaces #58538 ## Additional information [Design doc on my notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link) --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com>

…-project#58659) ## Description Internally, the `ApproximateTopK` aggregator uses `frequent_strings_sketch` to implement efficient top-k calculations. As hinted in the name `frequent_strings_sketch`, the current implementation casts all data to string before inputting it into the sketch, so the output data is also in string. Therefore, when we have numeric data, for instance, we would get: ``` [{"id": "1", "count": 5} ... ] # notice 1 is not an integer, but string ``` instead of ``` [{"id": 1, "count": 5} ... ] ``` which would be expected. Other types, like lists, tuples, etc will also be cast to string, making it hard for users to recover data. This PR (with offline discussion with some Ray Data team members) attempts to use the `pickle` library to pickle and unpickle data so that when you input the data to `frequent_strings_sketch`, you insert the hex string of the pickle dump. As further improvements, this PR also supports `encode_lists` flag to encode individual list values. This will be useful for our encoders (specifically `MultiHotEncoder` and `OrdinalEncoder`) in the future. ## Related issues N/A ## Additional information N/A --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

…flag (ray-project#58916) ## Description Basically the same idea as ray-project#58659 So `Unique` aggregator uses `pyarrow.compute.unique` function internally. This doesn't work with non-hashable types like lists. Similar to what I did for `ApproximateTopK`, we now use pickle to serialize and deserialize elements. Other improvements: - `ignore_nulls` flag didn't work at all. This flag now properly works - Had to force `ignore_nulls=False` for datasets `unique` api for backwards compatibility (we set `ignore_nulls` to `True` by default, so behavior for datasets `unique` api will change now that `ignore_nulls` actually works) ## Related issues This PR replaces ray-project#58538 ## Additional information [Design doc on my notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link) --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

improve approximatetopk

5a05e8d

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

kyuds requested a review from a team as a code owner November 15, 2025 14:02

gemini-code-assist bot reviewed Nov 15, 2025

View reviewed changes

python/ray/data/tests/test_custom_agg.py Outdated Show resolved Hide resolved

cursor bot reviewed Nov 15, 2025

View reviewed changes

python/ray/data/tests/test_custom_agg.py Outdated Show resolved Hide resolved

fix copy-paste mistake

1ba78bd

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 15, 2025

gvspraveen requested a review from goutamvenkat-anyscale November 17, 2025 05:27

bveeramani requested review from cem-anyscale and removed request for goutamvenkat-anyscale November 17, 2025 18:02

cem-anyscale reviewed Nov 18, 2025

View reviewed changes

python/ray/data/aggregate.py Show resolved Hide resolved

cem-anyscale approved these changes Nov 18, 2025

View reviewed changes

add doc comments per review

b68e561

Signed-off-by: kyuds <kyuseung1016@gmail.com>

bveeramani reviewed Nov 18, 2025

View reviewed changes

reflect review comments

f1fc217

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

kyuds requested a review from bveeramani November 18, 2025 13:44

cursor bot reviewed Nov 18, 2025

View reviewed changes

cursor report

07c9eb8

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

bveeramani enabled auto-merge (squash) November 18, 2025 18:23

github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 18, 2025

bveeramani approved these changes Nov 19, 2025

View reviewed changes

bveeramani merged commit 319caf3 into ray-project:master Nov 19, 2025
7 checks passed

kyuds deleted the improve-approx-topk-agg branch November 20, 2025 00:20

This was referenced Nov 22, 2025

[Data] Support List Types for Unique Aggregator and encode_lists flag #58916

Merged

[Data] Allow Unique and ApproximateTopK to encode list values #58538

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Support Non-String Items for `ApproximateTopK` Aggregator#58659

[Data] Support Non-String Items for `ApproximateTopK` Aggregator#58659
bveeramani merged 5 commits intoray-project:masterfrom
kyuds:improve-approx-topk-agg

kyuds commented Nov 15, 2025 •

edited

Loading

Uh oh!

kyuds commented Nov 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bveeramani Nov 18, 2025

Uh oh!

kyuds Nov 18, 2025 •

edited

Loading

Uh oh!

cursor bot Nov 18, 2025

Uh oh!

bveeramani left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kyuds commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

kyuds commented Nov 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bveeramani Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

kyuds Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: Restore explicit integer casting for counts.

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kyuds commented Nov 15, 2025 •

edited

Loading

kyuds Nov 18, 2025 •

edited

Loading