[Data] Use ApproximateTopK and Unique Aggregators for Encoders#58450
[Data] Use ApproximateTopK and Unique Aggregators for Encoders#58450kyuds wants to merge 10 commits intoray-project:masterfrom
Conversation
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request is a great improvement. It refactors the encoder preprocessors to use the Unique and ApproximateTopK aggregators, which simplifies the codebase significantly and should improve performance. The logic of replacing the manual map_batches implementation with dedicated aggregators is sound. The changes in aggregate.py to support list encoding in Unique and ApproximateTopK are also well-implemented. Overall, this is a high-quality refactoring. I have one minor suggestion to correct a docstring.
|
|
||
| def list_as_category(element): | ||
| key = tuple(element) | ||
| key = str(element) |
There was a problem hiding this comment.
Bug: Encoding type change breaks category sorting order
Changing from tuple(element) to str(element) for list encoding alters the sort order of categories. String representations sort lexicographically (e.g., "[1, 10, 2]" vs "[1, 2, 3]"), while tuples sort element-wise (e.g., (1, 10, 2) vs (1, 2, 3)). This produces different encoding indices in unique_post_fn, breaking compatibility with existing fitted encoders and potentially causing incorrect transformations.
There was a problem hiding this comment.
pinging @richardliaw @alexeykudinkin - would this be relevant?
There was a problem hiding this comment.
What's the motivation for using str rather than tuple here?
There was a problem hiding this comment.
because ApproximateTopK serializes everything to string for datasketches (frequent_strings_sketch). Therefore, to make data types consistent, have to serialize (cast) to str
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
|
|
||
| to_aggregate_unique = [] | ||
|
|
||
| for col in columns: |
There was a problem hiding this comment.
Bug: Inconsistent Encoding Between Aggregators with Max Categories
The ApproximateTopK aggregator converts all values to strings via str(py_value), while the Unique aggregator preserves original types for scalar values. This causes inconsistent encoding mappings when max_categories is specified versus when it's not. For numeric columns, this leads to lexicographic sorting (e.g., ['1', '10', '2']) instead of numeric sorting (e.g., [1, 2, 10]), producing different integer encodings for the same categories depending on whether max_categories is used.
|
|
||
| test_inputs = {"category": ["1", [1]]} | ||
| test_pd_df = pd.DataFrame(test_inputs) | ||
| test_data_for_fitting = {"category": ["1", "[1]", "a", "[]", "True"]} |
There was a problem hiding this comment.
had to change this because now, instead of serializing lists to tuple, we serialize via calling str() function (this is because pyarrow.unique and datasketches used right now will not accept tuples.
Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
| self.k = k | ||
| self._log_capacity = log_capacity | ||
| self._frequent_strings_sketch = self._require_datasketches() | ||
| self.encode_lists = encode_lists |
There was a problem hiding this comment.
Nit: For consistency with the other attributes?
| self.encode_lists = encode_lists | |
| self.encode_lists = encode_lists |
There was a problem hiding this comment.
Im assuming you mean prepended underscore?
| log_capacity: Base 2 logarithm of the maximum size of the internal hash map for | ||
| top-K calculation. Higher values increase accuracy but use more memory. | ||
| Defaults to 11 (2048 categories). |
There was a problem hiding this comment.
Do we need to expose this as a top-level parameter for now? This parameter couples the interface with a specific implementation, and it might make it harder to change the implementation later
| py_list = col.to_pylist() | ||
| str_list = [None if v is None else str(v) for v in py_list] | ||
| col = pyarrow.array(str_list, type=pyarrow.string()) | ||
| return pc.unique(col).to_pylist() |
There was a problem hiding this comment.
What was the previous behavior if you have a list column with Unique?
There was a problem hiding this comment.
it crashes (can't use pyarrow.compute unique function on lists)
|
|
||
| def list_as_category(element): | ||
| key = tuple(element) | ||
| key = str(element) |
There was a problem hiding this comment.
What's the motivation for using str rather than tuple here?
|
Discussed directly with @kyuds -- the plan is to split this PR up into a few smaller PRs so they're easier to review. |
|
superceded by follow up prs |
Description
Currently, Ray Data encoder preprocessors will count the number of occurrences per distinct item, regardless of whether we actually need the top-k information. This is inefficient for couple reasons:
max_categories(OneHot, MultiHot), there are still columns not included inmax_categories, and even if they are included, calculating exact top-k is inefficient.Therefore, this PR changes the encoders to use
Uniqueaggregator when possible, and to useApproximateTopKaggregator whenmax_categoriesis set for OneHot and MultiHot encoders.Considerations:
Because now we are using aggregations, we do expect lists to have homogenous datatypes when using
OneHotEncoder. Basically:This is not possible.
Note that such semantics were already not possible for preprocessors that encode lists, like
MultiHotEncoder,OrdinalEncoder(withencode_lists=True)Related issues
N/A
Additional information
N/A