Skip to content

[Data] Use ApproximateTopK and Unique Aggregators for Encoders#58450

Closed
kyuds wants to merge 10 commits intoray-project:masterfrom
kyuds:approx-topk
Closed

[Data] Use ApproximateTopK and Unique Aggregators for Encoders#58450
kyuds wants to merge 10 commits intoray-project:masterfrom
kyuds:approx-topk

Conversation

@kyuds
Copy link
Copy Markdown
Member

@kyuds kyuds commented Nov 7, 2025

Description

Currently, Ray Data encoder preprocessors will count the number of occurrences per distinct item, regardless of whether we actually need the top-k information. This is inefficient for couple reasons:

  1. For most encoders, we don't need this count information. We just need the unique values.
  2. For encoders that use max_categories (OneHot, MultiHot), there are still columns not included in max_categories, and even if they are included, calculating exact top-k is inefficient.

Therefore, this PR changes the encoders to use Unique aggregator when possible, and to use ApproximateTopK aggregator when max_categories is set for OneHot and MultiHot encoders.

Considerations:
Because now we are using aggregations, we do expect lists to have homogenous datatypes when using OneHotEncoder. Basically:

df = pd.DataFrame({
    "name": ["Shaolin Soccer", "Moana", "The Smartest Guys in the Room"],
    "genre": [
        ["comedy", "action", "sports"],
        ["animation", 2,  "action"], # <- notice the integer here
        ["documentary"],
    ],
})
ds = ray.data.from_pandas(df)  
encoder = OneHotEncoder(columns=["genre"])
pdf = encoder.fit_transform(ds).to_pandas() 
print(pdf)

This is not possible.

Note that such semantics were already not possible for preprocessors that encode lists, like MultiHotEncoder, OrdinalEncoder (with encode_lists=True)

Related issues

N/A

Additional information

N/A

kyuds added 6 commits November 7, 2025 17:51
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
@kyuds kyuds requested a review from a team as a code owner November 7, 2025 12:00
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement. It refactors the encoder preprocessors to use the Unique and ApproximateTopK aggregators, which simplifies the codebase significantly and should improve performance. The logic of replacing the manual map_batches implementation with dedicated aggregators is sound. The changes in aggregate.py to support list encoding in Unique and ApproximateTopK are also well-implemented. Overall, this is a high-quality refactoring. I have one minor suggestion to correct a docstring.


def list_as_category(element):
key = tuple(element)
key = str(element)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Encoding type change breaks category sorting order

Changing from tuple(element) to str(element) for list encoding alters the sort order of categories. String representations sort lexicographically (e.g., "[1, 10, 2]" vs "[1, 2, 3]"), while tuples sort element-wise (e.g., (1, 10, 2) vs (1, 2, 3)). This produces different encoding indices in unique_post_fn, breaking compatibility with existing fitted encoders and potentially causing incorrect transformations.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pinging @richardliaw @alexeykudinkin - would this be relevant?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for using str rather than tuple here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because ApproximateTopK serializes everything to string for datasketches (frequent_strings_sketch). Therefore, to make data types consistent, have to serialize (cast) to str

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Nov 7, 2025
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

to_aggregate_unique = []

for col in columns:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Inconsistent Encoding Between Aggregators with Max Categories

The ApproximateTopK aggregator converts all values to strings via str(py_value), while the Unique aggregator preserves original types for scalar values. This causes inconsistent encoding mappings when max_categories is specified versus when it's not. For numeric columns, this leads to lexicographic sorting (e.g., ['1', '10', '2']) instead of numeric sorting (e.g., [1, 2, 10]), producing different integer encodings for the same categories depending on whether max_categories is used.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this matter?


test_inputs = {"category": ["1", [1]]}
test_pd_df = pd.DataFrame(test_inputs)
test_data_for_fitting = {"category": ["1", "[1]", "a", "[]", "True"]}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to change this because now, instead of serializing lists to tuple, we serialize via calling str() function (this is because pyarrow.unique and datasketches used right now will not accept tuples.

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
@richardliaw richardliaw added the go add ONLY when ready to merge, run all tests label Nov 11, 2025
self.k = k
self._log_capacity = log_capacity
self._frequent_strings_sketch = self._require_datasketches()
self.encode_lists = encode_lists
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: For consistency with the other attributes?

Suggested change
self.encode_lists = encode_lists
self.encode_lists = encode_lists

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im assuming you mean prepended underscore?

Comment on lines +243 to +245
log_capacity: Base 2 logarithm of the maximum size of the internal hash map for
top-K calculation. Higher values increase accuracy but use more memory.
Defaults to 11 (2048 categories).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to expose this as a top-level parameter for now? This parameter couples the interface with a specific implementation, and it might make it harder to change the implementation later

py_list = col.to_pylist()
str_list = [None if v is None else str(v) for v in py_list]
col = pyarrow.array(str_list, type=pyarrow.string())
return pc.unique(col).to_pylist()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the previous behavior if you have a list column with Unique?

Copy link
Copy Markdown
Member Author

@kyuds kyuds Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it crashes (can't use pyarrow.compute unique function on lists)


def list_as_category(element):
key = tuple(element)
key = str(element)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for using str rather than tuple here?

@bveeramani
Copy link
Copy Markdown
Member

Discussed directly with @kyuds -- the plan is to split this PR up into a few smaller PRs so they're easier to review.

@kyuds
Copy link
Copy Markdown
Member Author

kyuds commented Dec 11, 2025

superceded by follow up prs

@kyuds kyuds closed this Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants