[Data] Use ApproximateTopK and Unique Aggregators for Encoders by kyuds · Pull Request #58450 · ray-project/ray

kyuds · 2025-11-07T12:00:51Z

Description

Currently, Ray Data encoder preprocessors will count the number of occurrences per distinct item, regardless of whether we actually need the top-k information. This is inefficient for couple reasons:

For most encoders, we don't need this count information. We just need the unique values.
For encoders that use max_categories (OneHot, MultiHot), there are still columns not included in max_categories, and even if they are included, calculating exact top-k is inefficient.

Therefore, this PR changes the encoders to use Unique aggregator when possible, and to use ApproximateTopK aggregator when max_categories is set for OneHot and MultiHot encoders.

Considerations:
Because now we are using aggregations, we do expect lists to have homogenous datatypes when using OneHotEncoder. Basically:

df = pd.DataFrame({
    "name": ["Shaolin Soccer", "Moana", "The Smartest Guys in the Room"],
    "genre": [
        ["comedy", "action", "sports"],
        ["animation", 2,  "action"], # <- notice the integer here
        ["documentary"],
    ],
})
ds = ray.data.from_pandas(df)  
encoder = OneHotEncoder(columns=["genre"])
pdf = encoder.fit_transform(ds).to_pandas() 
print(pdf)

This is not possible.

Note that such semantics were already not possible for preprocessors that encode lists, like MultiHotEncoder, OrdinalEncoder (with encode_lists=True)

Related issues

N/A

Additional information

N/A

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

gemini-code-assist

Code Review

This pull request is a great improvement. It refactors the encoder preprocessors to use the Unique and ApproximateTopK aggregators, which simplifies the codebase significantly and should improve performance. The logic of replacing the manual map_batches implementation with dedicated aggregators is sound. The changes in aggregate.py to support list encoding in Unique and ApproximateTopK are also well-implemented. Overall, this is a high-quality refactoring. I have one minor suggestion to correct a docstring.

python/ray/data/preprocessors/encoder.py

cursor · 2025-11-07T12:05:06Z

python/ray/data/preprocessors/encoder.py


                def list_as_category(element):
-                    key = tuple(element)
+                    key = str(element)


Bug: Encoding type change breaks category sorting order

Changing from tuple(element) to str(element) for list encoding alters the sort order of categories. String representations sort lexicographically (e.g., "[1, 10, 2]" vs "[1, 2, 3]"), while tuples sort element-wise (e.g., (1, 10, 2) vs (1, 2, 3)). This produces different encoding indices in unique_post_fn, breaking compatibility with existing fitted encoders and potentially causing incorrect transformations.

pinging @richardliaw @alexeykudinkin - would this be relevant?

What's the motivation for using str rather than tuple here?

because ApproximateTopK serializes everything to string for datasketches (frequent_strings_sketch). Therefore, to make data types consistent, have to serialize (cast) to str

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

python/ray/data/preprocessors/encoder.py

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

cursor · 2025-11-07T13:44:20Z

python/ray/data/preprocessors/encoder.py

+
+    to_aggregate_unique = []
+
+    for col in columns:


Bug: Inconsistent Encoding Between Aggregators with Max Categories

The ApproximateTopK aggregator converts all values to strings via str(py_value), while the Unique aggregator preserves original types for scalar values. This causes inconsistent encoding mappings when max_categories is specified versus when it's not. For numeric columns, this leads to lexicographic sorting (e.g., ['1', '10', '2']) instead of numeric sorting (e.g., [1, 2, 10]), producing different integer encodings for the same categories depending on whether max_categories is used.

will this matter?

python/ray/data/aggregate.py

kyuds · 2025-11-07T13:44:45Z

python/ray/data/tests/preprocessors/test_encoder.py


    test_inputs = {"category": ["1", [1]]}
    test_pd_df = pd.DataFrame(test_inputs)
-    test_data_for_fitting = {"category": ["1", "[1]", "a", "[]", "True"]}


had to change this because now, instead of serializing lists to tuple, we serialize via calling str() function (this is because pyarrow.unique and datasketches used right now will not accept tuples.

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

bveeramani · 2025-11-11T01:40:50Z

python/ray/data/aggregate.py

        self.k = k
        self._log_capacity = log_capacity
        self._frequent_strings_sketch = self._require_datasketches()
+        self.encode_lists = encode_lists


Nit: For consistency with the other attributes?

Suggested change

self.encode_lists = encode_lists

self.encode_lists = encode_lists

Im assuming you mean prepended underscore?

bveeramani · 2025-11-11T02:07:17Z

python/ray/data/preprocessors/encoder.py

+        log_capacity: Base 2 logarithm of the maximum size of the internal hash map for
+            top-K calculation. Higher values increase accuracy but use more memory.
+            Defaults to 11 (2048 categories).


Do we need to expose this as a top-level parameter for now? This parameter couples the interface with a specific implementation, and it might make it harder to change the implementation later

bveeramani · 2025-11-11T02:10:04Z

python/ray/data/aggregate.py

+                py_list = col.to_pylist()
+                str_list = [None if v is None else str(v) for v in py_list]
+                col = pyarrow.array(str_list, type=pyarrow.string())
+        return pc.unique(col).to_pylist()


What was the previous behavior if you have a list column with Unique?

it crashes (can't use pyarrow.compute unique function on lists)

bveeramani · 2025-11-11T02:10:58Z

python/ray/data/preprocessors/encoder.py


                def list_as_category(element):
-                    key = tuple(element)
+                    key = str(element)


What's the motivation for using str rather than tuple here?

bveeramani · 2025-11-11T02:44:07Z

Discussed directly with @kyuds -- the plan is to split this PR up into a few smaller PRs so they're easier to review.

kyuds · 2025-12-11T07:57:16Z

superceded by follow up prs

kyuds added 6 commits November 7, 2025 17:51

some progress: Unique aggregator

b5caa26

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

progress

3f5f417

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

support list serialization for unique and allow more one hot encodings

d1d2b6b

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

format

9bc3a7e

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

finish ordinal, one, multihot encoders

41733e4

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

finish label, categorizer

9f8ee5e

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

kyuds requested a review from a team as a code owner November 7, 2025 12:00

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

python/ray/data/preprocessors/encoder.py Outdated Show resolved Hide resolved

cursor bot reviewed Nov 7, 2025

View reviewed changes

update function desc

e30e169

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

cursor bot reviewed Nov 7, 2025

View reviewed changes

python/ray/data/preprocessors/encoder.py Outdated Show resolved Hide resolved

reflect cursor comment

ccece47

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 7, 2025

edit test to match changed semantics

3bb4889

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

cursor bot reviewed Nov 7, 2025

View reviewed changes

kyuds commented Nov 7, 2025

View reviewed changes

preserve None list

0d0b549

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

richardliaw added the go add ONLY when ready to merge, run all tests label Nov 11, 2025

bveeramani reviewed Nov 11, 2025

View reviewed changes

kyuds mentioned this pull request Nov 11, 2025

[Data] Allow Unique and ApproximateTopK to encode list values #58538

Closed

bveeramani self-assigned this Nov 13, 2025

bveeramani marked this pull request as draft November 21, 2025 19:07

bveeramani removed their assignment Nov 27, 2025

kyuds mentioned this pull request Dec 10, 2025

[Data] Use AggregateFnV2 api for OrdinalEncoder #59349

Closed

kyuds closed this Dec 11, 2025

	self.encode_lists = encode_lists
	self.encode_lists = encode_lists

Conversation

kyuds commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cursor bot Nov 7, 2025

Choose a reason for hiding this comment

Bug: Encoding type change breaks category sorting order

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Nov 7, 2025

Choose a reason for hiding this comment

Bug: Inconsistent Encoding Between Aggregators with Max Categories

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyuds Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bveeramani commented Nov 11, 2025

Uh oh!

kyuds commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kyuds commented Nov 7, 2025 •

edited

Loading

kyuds Nov 11, 2025 •

edited

Loading