Skip to content

[Data][2/N] Arrow format for OneHotEncoder#59890

Merged
alexeykudinkin merged 38 commits intoray-project:masterfrom
xinyuangui2:preprocessor-batch-optimization
Jan 14, 2026
Merged

[Data][2/N] Arrow format for OneHotEncoder#59890
alexeykudinkin merged 38 commits intoray-project:masterfrom
xinyuangui2:preprocessor-batch-optimization

Conversation

@xinyuangui2
Copy link
Copy Markdown
Contributor

@xinyuangui2 xinyuangui2 commented Jan 6, 2026

Support arrow format in OneHotEncoder.

Benchmark: TPC-H SF10

The improvement for OneHotEncoder:

Before:

Batch Size Throughput (rows/sec) Avg Latency (ms) P50 (ms) P95 (ms) P99 (ms) Min (ms) Max (ms)
1 374 2.67 2.66 2.86 2.93 2.51 3.33
5 1,867 2.68 2.67 2.82 2.85 2.52 3.48
10 3,590 2.79 2.77 2.91 3.33 2.61 3.47
20 6,911 2.89 2.88 3.08 3.41 2.72 3.62
50 15,841 3.16 3.14 3.28 3.58 3.00 3.76
100 27,019 3.70 3.60 3.86 4.66 3.40 9.73

After:

Batch Size Throughput (rows/sec) Avg Latency (ms) P50 (ms) P95 (ms) P99 (ms) Min (ms) Max (ms)
1 1,187 0.84 0.82 0.92 1.21 0.80 1.32
5 5,888 0.85 0.82 0.99 1.13 0.80 1.17
10 12,144 0.82 0.81 0.86 0.94 0.79 1.19
20 24,385 0.82 0.81 0.86 0.88 0.79 1.17
50 59,987 0.83 0.82 0.87 0.91 0.81 1.18
100 118,853 0.84 0.83 0.88 0.97 0.81 1.18

Null behavior

We keep the null behaviors the same as the old pandas implementations.

Encoder Path Null Input Behavior Unseen Category Behavior
OrdinalEncoder Pandas ValueError NaN
OrdinalEncoder Arrow ValueError null (NaN when converted)
OneHotEncoder Pandas ValueError all-zeros vector
OneHotEncoder Arrow ValueError all-zeros vector

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for the OrdinalEncoder by implementing an adaptive strategy for transformations on Arrow tables. For large batches, it uses the vectorized pyarrow.compute.index_in, and for smaller batches, it falls back to a faster Python dictionary lookup to avoid PyArrow's hash table rebuild overhead. The logic has been cleanly refactored into several helper methods with caching for both Arrow arrays and Python lookup dictionaries. My feedback includes a suggestion to improve the caching implementation for better maintainability.

@xinyuangui2 xinyuangui2 added the go add ONLY when ready to merge, run all tests label Jan 6, 2026
xinyuangui2 and others added 5 commits January 6, 2026 17:48
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Updated the threshold for switching between Python dict and PyArrow pc.index_in from 10000 to 50.

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 changed the title Preprocessor batch optimization [Data] Cache Arrow arrays in OrdinalEncoder for batch processing Jan 6, 2026
@xinyuangui2 xinyuangui2 marked this pull request as ready for review January 6, 2026 20:38
@xinyuangui2 xinyuangui2 requested a review from a team as a code owner January 6, 2026 20:38
@xinyuangui2 xinyuangui2 requested a review from gvspraveen January 6, 2026 20:38
@xinyuangui2 xinyuangui2 requested a review from raulchen January 6, 2026 21:23
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Copy link
Copy Markdown
Contributor

@iamjustinhsu iamjustinhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

_validate_arrow(table, *self.columns)

# Check for list columns (runtime fallback for PandasBlockSchema datasets)
for col_name in self.columns:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Follow-up) Please make private fields properly private (prefix with _)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have this as one follow up: https://anyscale1.atlassian.net/browse/DATA-1775

Comment on lines +52 to +58
def _init_arrow_cache(self):
"""Initialize the Arrow array cache. Call this in __init__."""
self._cache: Dict[str, Tuple[pa.Array, pa.Array]] = {}

def _clear_arrow_cache(self):
"""Clear cached Arrow arrays to ensure fresh data after re-fitting."""
self._cache.clear()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need these 2?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To handle the case preprocessor.fit(A).fit(B).

Actually I found a bug and made a quick fix: #60031

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I create a ticket to follow up: https://anyscale1.atlassian.net/browse/DATA-1788

Once we fix this PR, we can remove this cache.

"""Clear cached Arrow arrays to ensure fresh data after re-fitting."""
self._cache.clear()

def _get_arrow_arrays(self, input_col: str) -> Tuple[pa.Array, pa.Array]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinyuangui2 my concern with the caching implementation is following

  • We're adding complexity (which is fine by itself but needs to be clearly justified by performance win)
  • We're adding additional state that implementers of the Preprocessors now need to manage

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, on a second thought though, why don't we just use functools.cache on this method to remove this state from the Preprocessor itself?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The functools.cache doesn't have cache invalidation control which we want to include in fit
  2. The cache doesn't only depend on the input_col, but also self.stats_ implicitly.

I updated the benchmark with TCP_H. I think the performance win is clear.

Copy link
Copy Markdown
Contributor

@alexeykudinkin alexeykudinkin Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right, we can't use functools.cache unfortunately

xinyuangui2 and others added 3 commits January 13, 2026 17:44
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 changed the title [Data][2/N] Arrow format for OneHotEncoder and Cache Arrow arrays [Data][2/N] Arrow format for OneHotEncoder Jan 14, 2026
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
@alexeykudinkin alexeykudinkin merged commit 82cf17f into ray-project:master Jan 14, 2026
6 checks passed
jeffery4011 pushed a commit to jeffery4011/ray that referenced this pull request Jan 20, 2026
Support arrow format in OneHotEncoder.

Benchmark: TPC-H SF10

The improvement for `OneHotEncoder`:

Before:

| Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95
(ms) | P99 (ms) | Min (ms) | Max (ms) |

|:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| 1 | 374 | 2.67 | 2.66 | 2.86 | 2.93 | 2.51 | 3.33 |
| 5 | 1,867 | 2.68 | 2.67 | 2.82 | 2.85 | 2.52 | 3.48 |
| 10 | 3,590 | 2.79 | 2.77 | 2.91 | 3.33 | 2.61 | 3.47 |
| 20 | 6,911 | 2.89 | 2.88 | 3.08 | 3.41 | 2.72 | 3.62 |
| 50 | 15,841 | 3.16 | 3.14 | 3.28 | 3.58 | 3.00 | 3.76 |
| 100 | 27,019 | 3.70 | 3.60 | 3.86 | 4.66 | 3.40 | 9.73 |

After:

| Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95
(ms) | P99 (ms) | Min (ms) | Max (ms) |

|:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| 1 | 1,187 | 0.84 | 0.82 | 0.92 | 1.21 | 0.80 | 1.32 |
| 5 | 5,888 | 0.85 | 0.82 | 0.99 | 1.13 | 0.80 | 1.17 |
| 10 | 12,144 | 0.82 | 0.81 | 0.86 | 0.94 | 0.79 | 1.19 |
| 20 | 24,385 | 0.82 | 0.81 | 0.86 | 0.88 | 0.79 | 1.17 |
| 50 | 59,987 | 0.83 | 0.82 | 0.87 | 0.91 | 0.81 | 1.18 |
| 100 | 118,853 | 0.84 | 0.83 | 0.88 | 0.97 | 0.81 | 1.18 |

### Null behavior

We keep the null behaviors the same as the old pandas implementations.

| Encoder | Path | Null Input Behavior | Unseen Category Behavior |
|---------|------|---------------------|--------------------------|
| OrdinalEncoder | Pandas | **ValueError** | NaN |
| OrdinalEncoder | Arrow | **ValueError** | null (NaN when converted) |
| OneHotEncoder | Pandas | **ValueError** | all-zeros vector |
| OneHotEncoder | Arrow | **ValueError** | all-zeros vector |

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
Support arrow format in OneHotEncoder.


Benchmark: TPC-H SF10

The improvement for `OneHotEncoder`:

Before:

| Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95
(ms) | P99 (ms) | Min (ms) | Max (ms) |

|:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| 1 | 374 | 2.67 | 2.66 | 2.86 | 2.93 | 2.51 | 3.33 |
| 5 | 1,867 | 2.68 | 2.67 | 2.82 | 2.85 | 2.52 | 3.48 |
| 10 | 3,590 | 2.79 | 2.77 | 2.91 | 3.33 | 2.61 | 3.47 |
| 20 | 6,911 | 2.89 | 2.88 | 3.08 | 3.41 | 2.72 | 3.62 |
| 50 | 15,841 | 3.16 | 3.14 | 3.28 | 3.58 | 3.00 | 3.76 |
| 100 | 27,019 | 3.70 | 3.60 | 3.86 | 4.66 | 3.40 | 9.73 |

After:

| Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95
(ms) | P99 (ms) | Min (ms) | Max (ms) |

|:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| 1 | 1,187 | 0.84 | 0.82 | 0.92 | 1.21 | 0.80 | 1.32 |
| 5 | 5,888 | 0.85 | 0.82 | 0.99 | 1.13 | 0.80 | 1.17 |
| 10 | 12,144 | 0.82 | 0.81 | 0.86 | 0.94 | 0.79 | 1.19 |
| 20 | 24,385 | 0.82 | 0.81 | 0.86 | 0.88 | 0.79 | 1.17 |
| 50 | 59,987 | 0.83 | 0.82 | 0.87 | 0.91 | 0.81 | 1.18 |
| 100 | 118,853 | 0.84 | 0.83 | 0.88 | 0.97 | 0.81 | 1.18 |


### Null behavior

We keep the null behaviors the same as the old pandas implementations.

| Encoder | Path | Null Input Behavior | Unseen Category Behavior |
|---------|------|---------------------|--------------------------|
| OrdinalEncoder | Pandas | **ValueError** | NaN |
| OrdinalEncoder | Arrow | **ValueError** | null (NaN when converted) |
| OneHotEncoder | Pandas | **ValueError** | all-zeros vector |
| OneHotEncoder | Arrow | **ValueError** | all-zeros vector |

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
bveeramani pushed a commit that referenced this pull request Feb 25, 2026
## Description
Rename fields in preprocessors to conform to naming convention for
private fields in classes

## Related issues
Fixes variable naming issue reported in #59890

## Additional information
backwards compatibility with `property` so that old variable names still
provide access to the private variables

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
Support arrow format in OneHotEncoder.

Benchmark: TPC-H SF10

The improvement for `OneHotEncoder`:

Before:

| Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95
(ms) | P99 (ms) | Min (ms) | Max (ms) |

|:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| 1 | 374 | 2.67 | 2.66 | 2.86 | 2.93 | 2.51 | 3.33 |
| 5 | 1,867 | 2.68 | 2.67 | 2.82 | 2.85 | 2.52 | 3.48 |
| 10 | 3,590 | 2.79 | 2.77 | 2.91 | 3.33 | 2.61 | 3.47 |
| 20 | 6,911 | 2.89 | 2.88 | 3.08 | 3.41 | 2.72 | 3.62 |
| 50 | 15,841 | 3.16 | 3.14 | 3.28 | 3.58 | 3.00 | 3.76 |
| 100 | 27,019 | 3.70 | 3.60 | 3.86 | 4.66 | 3.40 | 9.73 |

After:

| Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95
(ms) | P99 (ms) | Min (ms) | Max (ms) |

|:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| 1 | 1,187 | 0.84 | 0.82 | 0.92 | 1.21 | 0.80 | 1.32 |
| 5 | 5,888 | 0.85 | 0.82 | 0.99 | 1.13 | 0.80 | 1.17 |
| 10 | 12,144 | 0.82 | 0.81 | 0.86 | 0.94 | 0.79 | 1.19 |
| 20 | 24,385 | 0.82 | 0.81 | 0.86 | 0.88 | 0.79 | 1.17 |
| 50 | 59,987 | 0.83 | 0.82 | 0.87 | 0.91 | 0.81 | 1.18 |
| 100 | 118,853 | 0.84 | 0.83 | 0.88 | 0.97 | 0.81 | 1.18 |

### Null behavior

We keep the null behaviors the same as the old pandas implementations.

| Encoder | Path | Null Input Behavior | Unseen Category Behavior |
|---------|------|---------------------|--------------------------|
| OrdinalEncoder | Pandas | **ValueError** | NaN |
| OrdinalEncoder | Arrow | **ValueError** | null (NaN when converted) |
| OneHotEncoder | Pandas | **ValueError** | all-zeros vector |
| OneHotEncoder | Arrow | **ValueError** | all-zeros vector |

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
Support arrow format in OneHotEncoder.

Benchmark: TPC-H SF10

The improvement for `OneHotEncoder`:

Before:

| Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95
(ms) | P99 (ms) | Min (ms) | Max (ms) |

|:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| 1 | 374 | 2.67 | 2.66 | 2.86 | 2.93 | 2.51 | 3.33 |
| 5 | 1,867 | 2.68 | 2.67 | 2.82 | 2.85 | 2.52 | 3.48 |
| 10 | 3,590 | 2.79 | 2.77 | 2.91 | 3.33 | 2.61 | 3.47 |
| 20 | 6,911 | 2.89 | 2.88 | 3.08 | 3.41 | 2.72 | 3.62 |
| 50 | 15,841 | 3.16 | 3.14 | 3.28 | 3.58 | 3.00 | 3.76 |
| 100 | 27,019 | 3.70 | 3.60 | 3.86 | 4.66 | 3.40 | 9.73 |

After:

| Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95
(ms) | P99 (ms) | Min (ms) | Max (ms) |

|:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| 1 | 1,187 | 0.84 | 0.82 | 0.92 | 1.21 | 0.80 | 1.32 |
| 5 | 5,888 | 0.85 | 0.82 | 0.99 | 1.13 | 0.80 | 1.17 |
| 10 | 12,144 | 0.82 | 0.81 | 0.86 | 0.94 | 0.79 | 1.19 |
| 20 | 24,385 | 0.82 | 0.81 | 0.86 | 0.88 | 0.79 | 1.17 |
| 50 | 59,987 | 0.83 | 0.82 | 0.87 | 0.91 | 0.81 | 1.18 |
| 100 | 118,853 | 0.84 | 0.83 | 0.88 | 0.97 | 0.81 | 1.18 |

### Null behavior

We keep the null behaviors the same as the old pandas implementations.

| Encoder | Path | Null Input Behavior | Unseen Category Behavior |
|---------|------|---------------------|--------------------------|
| OrdinalEncoder | Pandas | **ValueError** | NaN |
| OrdinalEncoder | Arrow | **ValueError** | null (NaN when converted) |
| OneHotEncoder | Pandas | **ValueError** | all-zeros vector |
| OneHotEncoder | Arrow | **ValueError** | all-zeros vector |

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants