[Data][2/N] Arrow format for OneHotEncoder#59890
[Data][2/N] Arrow format for OneHotEncoder#59890alexeykudinkin merged 38 commits intoray-project:masterfrom
Conversation
The GIL makes checking s`elf._serialize_cache is not None` atomic, so we don't need lock. Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a performance optimization for the OrdinalEncoder by implementing an adaptive strategy for transformations on Arrow tables. For large batches, it uses the vectorized pyarrow.compute.index_in, and for smaller batches, it falls back to a faster Python dictionary lookup to avoid PyArrow's hash table rebuild overhead. The logic has been cleanly refactored into several helper methods with caching for both Arrow arrays and Python lookup dictionaries. My feedback includes a suggestion to improve the caching implementation for better maintainability.
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Updated the threshold for switching between Python dict and PyArrow pc.index_in from 10000 to 50. Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
| _validate_arrow(table, *self.columns) | ||
|
|
||
| # Check for list columns (runtime fallback for PandasBlockSchema datasets) | ||
| for col_name in self.columns: |
There was a problem hiding this comment.
(Follow-up) Please make private fields properly private (prefix with _)
There was a problem hiding this comment.
I will have this as one follow up: https://anyscale1.atlassian.net/browse/DATA-1775
| def _init_arrow_cache(self): | ||
| """Initialize the Arrow array cache. Call this in __init__.""" | ||
| self._cache: Dict[str, Tuple[pa.Array, pa.Array]] = {} | ||
|
|
||
| def _clear_arrow_cache(self): | ||
| """Clear cached Arrow arrays to ensure fresh data after re-fitting.""" | ||
| self._cache.clear() |
There was a problem hiding this comment.
Why do we need these 2?
There was a problem hiding this comment.
To handle the case preprocessor.fit(A).fit(B).
Actually I found a bug and made a quick fix: #60031
There was a problem hiding this comment.
I create a ticket to follow up: https://anyscale1.atlassian.net/browse/DATA-1788
Once we fix this PR, we can remove this cache.
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
| """Clear cached Arrow arrays to ensure fresh data after re-fitting.""" | ||
| self._cache.clear() | ||
|
|
||
| def _get_arrow_arrays(self, input_col: str) -> Tuple[pa.Array, pa.Array]: |
There was a problem hiding this comment.
@xinyuangui2 my concern with the caching implementation is following
- We're adding complexity (which is fine by itself but needs to be clearly justified by performance win)
- We're adding additional state that implementers of the Preprocessors now need to manage
There was a problem hiding this comment.
Oh, on a second thought though, why don't we just use functools.cache on this method to remove this state from the Preprocessor itself?
There was a problem hiding this comment.
- The
functools.cachedoesn't have cache invalidation control which we want to include infit - The cache doesn't only depend on the
input_col, but alsoself.stats_implicitly.
I updated the benchmark with TCP_H. I think the performance win is clear.
There was a problem hiding this comment.
Oh, you're right, we can't use functools.cache unfortunately
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Support arrow format in OneHotEncoder. Benchmark: TPC-H SF10 The improvement for `OneHotEncoder`: Before: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 374 | 2.67 | 2.66 | 2.86 | 2.93 | 2.51 | 3.33 | | 5 | 1,867 | 2.68 | 2.67 | 2.82 | 2.85 | 2.52 | 3.48 | | 10 | 3,590 | 2.79 | 2.77 | 2.91 | 3.33 | 2.61 | 3.47 | | 20 | 6,911 | 2.89 | 2.88 | 3.08 | 3.41 | 2.72 | 3.62 | | 50 | 15,841 | 3.16 | 3.14 | 3.28 | 3.58 | 3.00 | 3.76 | | 100 | 27,019 | 3.70 | 3.60 | 3.86 | 4.66 | 3.40 | 9.73 | After: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 1,187 | 0.84 | 0.82 | 0.92 | 1.21 | 0.80 | 1.32 | | 5 | 5,888 | 0.85 | 0.82 | 0.99 | 1.13 | 0.80 | 1.17 | | 10 | 12,144 | 0.82 | 0.81 | 0.86 | 0.94 | 0.79 | 1.19 | | 20 | 24,385 | 0.82 | 0.81 | 0.86 | 0.88 | 0.79 | 1.17 | | 50 | 59,987 | 0.83 | 0.82 | 0.87 | 0.91 | 0.81 | 1.18 | | 100 | 118,853 | 0.84 | 0.83 | 0.88 | 0.97 | 0.81 | 1.18 | ### Null behavior We keep the null behaviors the same as the old pandas implementations. | Encoder | Path | Null Input Behavior | Unseen Category Behavior | |---------|------|---------------------|--------------------------| | OrdinalEncoder | Pandas | **ValueError** | NaN | | OrdinalEncoder | Arrow | **ValueError** | null (NaN when converted) | | OneHotEncoder | Pandas | **ValueError** | all-zeros vector | | OneHotEncoder | Arrow | **ValueError** | all-zeros vector | --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>
Support arrow format in OneHotEncoder. Benchmark: TPC-H SF10 The improvement for `OneHotEncoder`: Before: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 374 | 2.67 | 2.66 | 2.86 | 2.93 | 2.51 | 3.33 | | 5 | 1,867 | 2.68 | 2.67 | 2.82 | 2.85 | 2.52 | 3.48 | | 10 | 3,590 | 2.79 | 2.77 | 2.91 | 3.33 | 2.61 | 3.47 | | 20 | 6,911 | 2.89 | 2.88 | 3.08 | 3.41 | 2.72 | 3.62 | | 50 | 15,841 | 3.16 | 3.14 | 3.28 | 3.58 | 3.00 | 3.76 | | 100 | 27,019 | 3.70 | 3.60 | 3.86 | 4.66 | 3.40 | 9.73 | After: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 1,187 | 0.84 | 0.82 | 0.92 | 1.21 | 0.80 | 1.32 | | 5 | 5,888 | 0.85 | 0.82 | 0.99 | 1.13 | 0.80 | 1.17 | | 10 | 12,144 | 0.82 | 0.81 | 0.86 | 0.94 | 0.79 | 1.19 | | 20 | 24,385 | 0.82 | 0.81 | 0.86 | 0.88 | 0.79 | 1.17 | | 50 | 59,987 | 0.83 | 0.82 | 0.87 | 0.91 | 0.81 | 1.18 | | 100 | 118,853 | 0.84 | 0.83 | 0.88 | 0.97 | 0.81 | 1.18 | ### Null behavior We keep the null behaviors the same as the old pandas implementations. | Encoder | Path | Null Input Behavior | Unseen Category Behavior | |---------|------|---------------------|--------------------------| | OrdinalEncoder | Pandas | **ValueError** | NaN | | OrdinalEncoder | Arrow | **ValueError** | null (NaN when converted) | | OneHotEncoder | Pandas | **ValueError** | all-zeros vector | | OneHotEncoder | Arrow | **ValueError** | all-zeros vector | --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com>
## Description Rename fields in preprocessors to conform to naming convention for private fields in classes ## Related issues Fixes variable naming issue reported in #59890 ## Additional information backwards compatibility with `property` so that old variable names still provide access to the private variables --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Support arrow format in OneHotEncoder. Benchmark: TPC-H SF10 The improvement for `OneHotEncoder`: Before: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 374 | 2.67 | 2.66 | 2.86 | 2.93 | 2.51 | 3.33 | | 5 | 1,867 | 2.68 | 2.67 | 2.82 | 2.85 | 2.52 | 3.48 | | 10 | 3,590 | 2.79 | 2.77 | 2.91 | 3.33 | 2.61 | 3.47 | | 20 | 6,911 | 2.89 | 2.88 | 3.08 | 3.41 | 2.72 | 3.62 | | 50 | 15,841 | 3.16 | 3.14 | 3.28 | 3.58 | 3.00 | 3.76 | | 100 | 27,019 | 3.70 | 3.60 | 3.86 | 4.66 | 3.40 | 9.73 | After: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 1,187 | 0.84 | 0.82 | 0.92 | 1.21 | 0.80 | 1.32 | | 5 | 5,888 | 0.85 | 0.82 | 0.99 | 1.13 | 0.80 | 1.17 | | 10 | 12,144 | 0.82 | 0.81 | 0.86 | 0.94 | 0.79 | 1.19 | | 20 | 24,385 | 0.82 | 0.81 | 0.86 | 0.88 | 0.79 | 1.17 | | 50 | 59,987 | 0.83 | 0.82 | 0.87 | 0.91 | 0.81 | 1.18 | | 100 | 118,853 | 0.84 | 0.83 | 0.88 | 0.97 | 0.81 | 1.18 | ### Null behavior We keep the null behaviors the same as the old pandas implementations. | Encoder | Path | Null Input Behavior | Unseen Category Behavior | |---------|------|---------------------|--------------------------| | OrdinalEncoder | Pandas | **ValueError** | NaN | | OrdinalEncoder | Arrow | **ValueError** | null (NaN when converted) | | OneHotEncoder | Pandas | **ValueError** | all-zeros vector | | OneHotEncoder | Arrow | **ValueError** | all-zeros vector | --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Support arrow format in OneHotEncoder. Benchmark: TPC-H SF10 The improvement for `OneHotEncoder`: Before: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 374 | 2.67 | 2.66 | 2.86 | 2.93 | 2.51 | 3.33 | | 5 | 1,867 | 2.68 | 2.67 | 2.82 | 2.85 | 2.52 | 3.48 | | 10 | 3,590 | 2.79 | 2.77 | 2.91 | 3.33 | 2.61 | 3.47 | | 20 | 6,911 | 2.89 | 2.88 | 3.08 | 3.41 | 2.72 | 3.62 | | 50 | 15,841 | 3.16 | 3.14 | 3.28 | 3.58 | 3.00 | 3.76 | | 100 | 27,019 | 3.70 | 3.60 | 3.86 | 4.66 | 3.40 | 9.73 | After: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 1,187 | 0.84 | 0.82 | 0.92 | 1.21 | 0.80 | 1.32 | | 5 | 5,888 | 0.85 | 0.82 | 0.99 | 1.13 | 0.80 | 1.17 | | 10 | 12,144 | 0.82 | 0.81 | 0.86 | 0.94 | 0.79 | 1.19 | | 20 | 24,385 | 0.82 | 0.81 | 0.86 | 0.88 | 0.79 | 1.17 | | 50 | 59,987 | 0.83 | 0.82 | 0.87 | 0.91 | 0.81 | 1.18 | | 100 | 118,853 | 0.84 | 0.83 | 0.88 | 0.97 | 0.81 | 1.18 | ### Null behavior We keep the null behaviors the same as the old pandas implementations. | Encoder | Path | Null Input Behavior | Unseen Category Behavior | |---------|------|---------------------|--------------------------| | OrdinalEncoder | Pandas | **ValueError** | NaN | | OrdinalEncoder | Arrow | **ValueError** | null (NaN when converted) | | OneHotEncoder | Pandas | **ValueError** | all-zeros vector | | OneHotEncoder | Arrow | **ValueError** | all-zeros vector | --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Support arrow format in OneHotEncoder.
Benchmark: TPC-H SF10
The improvement for
OneHotEncoder:Before:
After:
Null behavior
We keep the null behaviors the same as the old pandas implementations.