Skip to content

[Data] Support Non-String Items for ApproximateTopK Aggregator#58659

Merged
bveeramani merged 5 commits intoray-project:masterfrom
kyuds:improve-approx-topk-agg
Nov 19, 2025
Merged

[Data] Support Non-String Items for ApproximateTopK Aggregator#58659
bveeramani merged 5 commits intoray-project:masterfrom
kyuds:improve-approx-topk-agg

Conversation

@kyuds
Copy link
Copy Markdown
Member

@kyuds kyuds commented Nov 15, 2025

Description

Internally, the ApproximateTopK aggregator uses frequent_strings_sketch to implement efficient top-k calculations. As hinted in the name frequent_strings_sketch, the current implementation casts all data to string before inputting it into the sketch, so the output data is also in string.

Therefore, when we have numeric data, for instance, we would get:

[{"id": "1", "count": 5} ... ]  # notice 1 is not an integer, but string

instead of

[{"id": 1, "count": 5} ... ]

which would be expected.

Other types, like lists, tuples, etc will also be cast to string, making it hard for users to recover data.

This PR (with offline discussion with some Ray Data team members) attempts to use the pickle library to pickle and unpickle data so that when you input the data to frequent_strings_sketch, you insert the hex string of the pickle dump.

As further improvements, this PR also supports encode_lists flag to encode individual list values. This will be useful for our encoders (specifically MultiHotEncoder and OrdinalEncoder) in the future.

Related issues

N/A

Additional information

N/A

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
@kyuds kyuds requested a review from a team as a code owner November 15, 2025 14:02
@kyuds
Copy link
Copy Markdown
Member Author

kyuds commented Nov 15, 2025

PTAL @owenowenisme @bveeramani

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement, adding support for non-string types in the ApproximateTopK aggregator by using pickle for serialization. This correctly preserves data types like integers and lists, which were previously cast to strings. The implementation is clean and the addition of the encode_lists flag provides useful flexibility for handling list elements. I've identified a critical issue in the new tests where assertions are incorrect due to what appears to be a copy-paste error. My review includes a code suggestion to fix this.

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 15, 2025
@bveeramani bveeramani requested review from cem-anyscale and removed request for goutamvenkat-anyscale November 17, 2025 18:02
Signed-off-by: kyuds <kyuseung1016@gmail.com>

return [
{self.get_target_column(): str(item[0]), "count": int(item[1])}
{column: pickle.loads(bytes.fromhex(str(item[0]))), "count": int(item[1])}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, what is the type of item[0]? Why do we need to convert it to a string first?

Copy link
Copy Markdown
Member Author

@kyuds kyuds Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was actually there due to the original implementation. I didnt really put much thought into it, but after investigation, there is no need to convert to string as it already is. Fixed accordingly, though I will say that the datasketches library is quite poorly documented

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
@kyuds kyuds requested a review from bveeramani November 18, 2025 13:44

return [
{self.get_target_column(): str(item[0]), "count": int(item[1])}
{column: pickle.loads(bytes.fromhex(item[0])), "count": item[1]}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Restore explicit integer casting for counts.

The count value is not explicitly cast to int. The previous implementation at line 1549 used int(item[1]), but the new code at line 1550 omits this cast. The datasketches library may return a non-Python int type (e.g., numpy int), and the explicit cast ensures type consistency with the original behavior and expectations.

Fix in Cursor Fix in Web

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
@bveeramani bveeramani enabled auto-merge (squash) November 18, 2025 18:23
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 18, 2025
Copy link
Copy Markdown
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution!!

@bveeramani bveeramani merged commit 319caf3 into ray-project:master Nov 19, 2025
7 checks passed
@kyuds kyuds deleted the improve-approx-topk-agg branch November 20, 2025 00:20
400Ping pushed a commit to 400Ping/ray that referenced this pull request Nov 21, 2025
…-project#58659)

## Description
Internally, the `ApproximateTopK` aggregator uses
`frequent_strings_sketch` to implement efficient top-k calculations. As
hinted in the name `frequent_strings_sketch`, the current implementation
casts all data to string before inputting it into the sketch, so the
output data is also in string.

Therefore, when we have numeric data, for instance, we would get:
```
[{"id": "1", "count": 5} ... ]  # notice 1 is not an integer, but string
```
instead of 
```
[{"id": 1, "count": 5} ... ]
```
which would be expected.

Other types, like lists, tuples, etc will also be cast to string, making
it hard for users to recover data.

This PR (with offline discussion with some Ray Data team members)
attempts to use the `pickle` library to pickle and unpickle data so that
when you input the data to `frequent_strings_sketch`, you insert the hex
string of the pickle dump.

As further improvements, this PR also supports `encode_lists` flag to
encode individual list values. This will be useful for our encoders
(specifically `MultiHotEncoder` and `OrdinalEncoder`) in the future.

## Related issues
N/A

## Additional information
N/A

---------

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: kyuds <kyuseung1016@gmail.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…-project#58659)

## Description
Internally, the `ApproximateTopK` aggregator uses
`frequent_strings_sketch` to implement efficient top-k calculations. As
hinted in the name `frequent_strings_sketch`, the current implementation
casts all data to string before inputting it into the sketch, so the
output data is also in string.

Therefore, when we have numeric data, for instance, we would get:
```
[{"id": "1", "count": 5} ... ]  # notice 1 is not an integer, but string
```
instead of
```
[{"id": 1, "count": 5} ... ]
```
which would be expected.

Other types, like lists, tuples, etc will also be cast to string, making
it hard for users to recover data.

This PR (with offline discussion with some Ray Data team members)
attempts to use the `pickle` library to pickle and unpickle data so that
when you input the data to `frequent_strings_sketch`, you insert the hex
string of the pickle dump.

As further improvements, this PR also supports `encode_lists` flag to
encode individual list values. This will be useful for our encoders
(specifically `MultiHotEncoder` and `OrdinalEncoder`) in the future.

## Related issues
N/A

## Additional information
N/A

---------

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: kyuds <kyuseung1016@gmail.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…-project#58659)

## Description
Internally, the `ApproximateTopK` aggregator uses
`frequent_strings_sketch` to implement efficient top-k calculations. As
hinted in the name `frequent_strings_sketch`, the current implementation
casts all data to string before inputting it into the sketch, so the
output data is also in string.

Therefore, when we have numeric data, for instance, we would get:
```
[{"id": "1", "count": 5} ... ]  # notice 1 is not an integer, but string
```
instead of 
```
[{"id": 1, "count": 5} ... ]
```
which would be expected.

Other types, like lists, tuples, etc will also be cast to string, making
it hard for users to recover data.

This PR (with offline discussion with some Ray Data team members)
attempts to use the `pickle` library to pickle and unpickle data so that
when you input the data to `frequent_strings_sketch`, you insert the hex
string of the pickle dump.

As further improvements, this PR also supports `encode_lists` flag to
encode individual list values. This will be useful for our encoders
(specifically `MultiHotEncoder` and `OrdinalEncoder`) in the future.

## Related issues
N/A

## Additional information
N/A

---------

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: kyuds <kyuseung1016@gmail.com>
richardliaw pushed a commit that referenced this pull request Dec 9, 2025
…flag (#58916)

## Description
Basically the same idea as #58659 

So `Unique` aggregator uses `pyarrow.compute.unique` function
internally. This doesn't work with non-hashable types like lists.
Similar to what I did for `ApproximateTopK`, we now use pickle to
serialize and deserialize elements.

Other improvements:
- `ignore_nulls` flag didn't work at all. This flag now properly works
- Had to force `ignore_nulls=False` for datasets `unique` api for
backwards compatibility (we set `ignore_nulls` to `True` by default, so
behavior for datasets `unique` api will change now that `ignore_nulls`
actually works)

## Related issues
This PR replaces #58538

## Additional information
[Design doc on my
notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link)

---------

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: kyuds <kyuseung1016@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…-project#58659)

## Description
Internally, the `ApproximateTopK` aggregator uses
`frequent_strings_sketch` to implement efficient top-k calculations. As
hinted in the name `frequent_strings_sketch`, the current implementation
casts all data to string before inputting it into the sketch, so the
output data is also in string.

Therefore, when we have numeric data, for instance, we would get:
```
[{"id": "1", "count": 5} ... ]  # notice 1 is not an integer, but string
```
instead of
```
[{"id": 1, "count": 5} ... ]
```
which would be expected.

Other types, like lists, tuples, etc will also be cast to string, making
it hard for users to recover data.

This PR (with offline discussion with some Ray Data team members)
attempts to use the `pickle` library to pickle and unpickle data so that
when you input the data to `frequent_strings_sketch`, you insert the hex
string of the pickle dump.

As further improvements, this PR also supports `encode_lists` flag to
encode individual list values. This will be useful for our encoders
(specifically `MultiHotEncoder` and `OrdinalEncoder`) in the future.

## Related issues
N/A

## Additional information
N/A

---------

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: kyuds <kyuseung1016@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…flag (ray-project#58916)

## Description
Basically the same idea as ray-project#58659

So `Unique` aggregator uses `pyarrow.compute.unique` function
internally. This doesn't work with non-hashable types like lists.
Similar to what I did for `ApproximateTopK`, we now use pickle to
serialize and deserialize elements.

Other improvements:
- `ignore_nulls` flag didn't work at all. This flag now properly works
- Had to force `ignore_nulls=False` for datasets `unique` api for
backwards compatibility (we set `ignore_nulls` to `True` by default, so
behavior for datasets `unique` api will change now that `ignore_nulls`
actually works)

## Related issues
This PR replaces ray-project#58538

## Additional information
[Design doc on my
notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link)

---------

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: kyuds <kyuseung1016@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants