[Data] Support Non-String Items for ApproximateTopK Aggregator#58659
[Data] Support Non-String Items for ApproximateTopK Aggregator#58659bveeramani merged 5 commits intoray-project:masterfrom
ApproximateTopK Aggregator#58659Conversation
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
|
PTAL @owenowenisme @bveeramani |
There was a problem hiding this comment.
Code Review
This pull request is a great improvement, adding support for non-string types in the ApproximateTopK aggregator by using pickle for serialization. This correctly preserves data types like integers and lists, which were previously cast to strings. The implementation is clean and the addition of the encode_lists flag provides useful flexibility for handling list elements. I've identified a critical issue in the new tests where assertions are incorrect due to what appears to be a copy-paste error. My review includes a code suggestion to fix this.
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: kyuds <kyuseung1016@gmail.com>
python/ray/data/aggregate.py
Outdated
|
|
||
| return [ | ||
| {self.get_target_column(): str(item[0]), "count": int(item[1])} | ||
| {column: pickle.loads(bytes.fromhex(str(item[0]))), "count": int(item[1])} |
There was a problem hiding this comment.
For my own understanding, what is the type of item[0]? Why do we need to convert it to a string first?
There was a problem hiding this comment.
this was actually there due to the original implementation. I didnt really put much thought into it, but after investigation, there is no need to convert to string as it already is. Fixed accordingly, though I will say that the datasketches library is quite poorly documented
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
python/ray/data/aggregate.py
Outdated
|
|
||
| return [ | ||
| {self.get_target_column(): str(item[0]), "count": int(item[1])} | ||
| {column: pickle.loads(bytes.fromhex(item[0])), "count": item[1]} |
There was a problem hiding this comment.
Bug: Restore explicit integer casting for counts.
The count value is not explicitly cast to int. The previous implementation at line 1549 used int(item[1]), but the new code at line 1550 omits this cast. The datasketches library may return a non-Python int type (e.g., numpy int), and the explicit cast ensures type consistency with the original behavior and expectations.
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
bveeramani
left a comment
There was a problem hiding this comment.
Thank you for the contribution!!
…-project#58659) ## Description Internally, the `ApproximateTopK` aggregator uses `frequent_strings_sketch` to implement efficient top-k calculations. As hinted in the name `frequent_strings_sketch`, the current implementation casts all data to string before inputting it into the sketch, so the output data is also in string. Therefore, when we have numeric data, for instance, we would get: ``` [{"id": "1", "count": 5} ... ] # notice 1 is not an integer, but string ``` instead of ``` [{"id": 1, "count": 5} ... ] ``` which would be expected. Other types, like lists, tuples, etc will also be cast to string, making it hard for users to recover data. This PR (with offline discussion with some Ray Data team members) attempts to use the `pickle` library to pickle and unpickle data so that when you input the data to `frequent_strings_sketch`, you insert the hex string of the pickle dump. As further improvements, this PR also supports `encode_lists` flag to encode individual list values. This will be useful for our encoders (specifically `MultiHotEncoder` and `OrdinalEncoder`) in the future. ## Related issues N/A ## Additional information N/A --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com>
…-project#58659) ## Description Internally, the `ApproximateTopK` aggregator uses `frequent_strings_sketch` to implement efficient top-k calculations. As hinted in the name `frequent_strings_sketch`, the current implementation casts all data to string before inputting it into the sketch, so the output data is also in string. Therefore, when we have numeric data, for instance, we would get: ``` [{"id": "1", "count": 5} ... ] # notice 1 is not an integer, but string ``` instead of ``` [{"id": 1, "count": 5} ... ] ``` which would be expected. Other types, like lists, tuples, etc will also be cast to string, making it hard for users to recover data. This PR (with offline discussion with some Ray Data team members) attempts to use the `pickle` library to pickle and unpickle data so that when you input the data to `frequent_strings_sketch`, you insert the hex string of the pickle dump. As further improvements, this PR also supports `encode_lists` flag to encode individual list values. This will be useful for our encoders (specifically `MultiHotEncoder` and `OrdinalEncoder`) in the future. ## Related issues N/A ## Additional information N/A --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
…-project#58659) ## Description Internally, the `ApproximateTopK` aggregator uses `frequent_strings_sketch` to implement efficient top-k calculations. As hinted in the name `frequent_strings_sketch`, the current implementation casts all data to string before inputting it into the sketch, so the output data is also in string. Therefore, when we have numeric data, for instance, we would get: ``` [{"id": "1", "count": 5} ... ] # notice 1 is not an integer, but string ``` instead of ``` [{"id": 1, "count": 5} ... ] ``` which would be expected. Other types, like lists, tuples, etc will also be cast to string, making it hard for users to recover data. This PR (with offline discussion with some Ray Data team members) attempts to use the `pickle` library to pickle and unpickle data so that when you input the data to `frequent_strings_sketch`, you insert the hex string of the pickle dump. As further improvements, this PR also supports `encode_lists` flag to encode individual list values. This will be useful for our encoders (specifically `MultiHotEncoder` and `OrdinalEncoder`) in the future. ## Related issues N/A ## Additional information N/A --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com>
…flag (#58916) ## Description Basically the same idea as #58659 So `Unique` aggregator uses `pyarrow.compute.unique` function internally. This doesn't work with non-hashable types like lists. Similar to what I did for `ApproximateTopK`, we now use pickle to serialize and deserialize elements. Other improvements: - `ignore_nulls` flag didn't work at all. This flag now properly works - Had to force `ignore_nulls=False` for datasets `unique` api for backwards compatibility (we set `ignore_nulls` to `True` by default, so behavior for datasets `unique` api will change now that `ignore_nulls` actually works) ## Related issues This PR replaces #58538 ## Additional information [Design doc on my notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link) --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com>
…-project#58659) ## Description Internally, the `ApproximateTopK` aggregator uses `frequent_strings_sketch` to implement efficient top-k calculations. As hinted in the name `frequent_strings_sketch`, the current implementation casts all data to string before inputting it into the sketch, so the output data is also in string. Therefore, when we have numeric data, for instance, we would get: ``` [{"id": "1", "count": 5} ... ] # notice 1 is not an integer, but string ``` instead of ``` [{"id": 1, "count": 5} ... ] ``` which would be expected. Other types, like lists, tuples, etc will also be cast to string, making it hard for users to recover data. This PR (with offline discussion with some Ray Data team members) attempts to use the `pickle` library to pickle and unpickle data so that when you input the data to `frequent_strings_sketch`, you insert the hex string of the pickle dump. As further improvements, this PR also supports `encode_lists` flag to encode individual list values. This will be useful for our encoders (specifically `MultiHotEncoder` and `OrdinalEncoder`) in the future. ## Related issues N/A ## Additional information N/A --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…flag (ray-project#58916) ## Description Basically the same idea as ray-project#58659 So `Unique` aggregator uses `pyarrow.compute.unique` function internally. This doesn't work with non-hashable types like lists. Similar to what I did for `ApproximateTopK`, we now use pickle to serialize and deserialize elements. Other improvements: - `ignore_nulls` flag didn't work at all. This flag now properly works - Had to force `ignore_nulls=False` for datasets `unique` api for backwards compatibility (we set `ignore_nulls` to `True` by default, so behavior for datasets `unique` api will change now that `ignore_nulls` actually works) ## Related issues This PR replaces ray-project#58538 ## Additional information [Design doc on my notion](https://www.notion.so/kyuds/Unique-Aggregator-Improvements-2b67a80e48eb80de9820edf9d4996e0a?source=copy_link) --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Internally, the
ApproximateTopKaggregator usesfrequent_strings_sketchto implement efficient top-k calculations. As hinted in the namefrequent_strings_sketch, the current implementation casts all data to string before inputting it into the sketch, so the output data is also in string.Therefore, when we have numeric data, for instance, we would get:
instead of
which would be expected.
Other types, like lists, tuples, etc will also be cast to string, making it hard for users to recover data.
This PR (with offline discussion with some Ray Data team members) attempts to use the
picklelibrary to pickle and unpickle data so that when you input the data tofrequent_strings_sketch, you insert the hex string of the pickle dump.As further improvements, this PR also supports
encode_listsflag to encode individual list values. This will be useful for our encoders (specificallyMultiHotEncoderandOrdinalEncoder) in the future.Related issues
N/A
Additional information
N/A