[Data][2/N] Arrow format for OneHotEncoder by xinyuangui2 · Pull Request #59890 · ray-project/ray

xinyuangui2 · 2026-01-06T08:32:52Z

Support arrow format in OneHotEncoder.

Benchmark: TPC-H SF10

The improvement for OneHotEncoder:

Before:

Batch Size	Throughput (rows/sec)	Avg Latency (ms)	P50 (ms)	P95 (ms)	P99 (ms)	Min (ms)	Max (ms)
1	374	2.67	2.66	2.86	2.93	2.51	3.33
5	1,867	2.68	2.67	2.82	2.85	2.52	3.48
10	3,590	2.79	2.77	2.91	3.33	2.61	3.47
20	6,911	2.89	2.88	3.08	3.41	2.72	3.62
50	15,841	3.16	3.14	3.28	3.58	3.00	3.76
100	27,019	3.70	3.60	3.86	4.66	3.40	9.73

After:

Batch Size	Throughput (rows/sec)	Avg Latency (ms)	P50 (ms)	P95 (ms)	P99 (ms)	Min (ms)	Max (ms)
1	1,187	0.84	0.82	0.92	1.21	0.80	1.32
5	5,888	0.85	0.82	0.99	1.13	0.80	1.17
10	12,144	0.82	0.81	0.86	0.94	0.79	1.19
20	24,385	0.82	0.81	0.86	0.88	0.79	1.17
50	59,987	0.83	0.82	0.87	0.91	0.81	1.18
100	118,853	0.84	0.83	0.88	0.97	0.81	1.18

Null behavior

We keep the null behaviors the same as the old pandas implementations.

Encoder	Path	Null Input Behavior	Unseen Category Behavior
OrdinalEncoder	Pandas	ValueError	NaN
OrdinalEncoder	Arrow	ValueError	null (NaN when converted)
OneHotEncoder	Pandas	ValueError	all-zeros vector
OneHotEncoder	Arrow	ValueError	all-zeros vector

The GIL makes checking s`elf._serialize_cache is not None` atomic, so we don't need lock. Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Signed-off-by: xgui <xgui@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a performance optimization for the OrdinalEncoder by implementing an adaptive strategy for transformations on Arrow tables. For large batches, it uses the vectorized pyarrow.compute.index_in, and for smaller batches, it falls back to a faster Python dictionary lookup to avoid PyArrow's hash table rebuild overhead. The logic has been cleanly refactored into several helper methods with caching for both Arrow arrays and Python lookup dictionaries. My feedback includes a suggestion to improve the caching implementation for better maintainability.

Signed-off-by: xgui <xgui@anyscale.com>

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Updated the threshold for switching between Python dict and PyArrow pc.index_in from 10000 to 50. Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Signed-off-by: xgui <xgui@anyscale.com>

python/ray/data/preprocessors/encoder.py

Signed-off-by: xgui <xgui@anyscale.com>

python/ray/data/preprocessors/encoder.py

Signed-off-by: xgui <xgui@anyscale.com>

python/ray/data/preprocessors/encoder.py

Signed-off-by: xgui <xgui@anyscale.com>

iamjustinhsu

lgtm!

alexeykudinkin · 2026-01-09T23:21:50Z

python/ray/data/preprocessors/encoder.py

+        _validate_arrow(table, *self.columns)
+
+        # Check for list columns (runtime fallback for PandasBlockSchema datasets)
+        for col_name in self.columns:


(Follow-up) Please make private fields properly private (prefix with _)

I will have this as one follow up: https://anyscale1.atlassian.net/browse/DATA-1775

python/ray/data/preprocessors/encoder.py

alexeykudinkin · 2026-01-09T23:31:01Z

python/ray/data/preprocessors/encoder.py

+    def _init_arrow_cache(self):
+        """Initialize the Arrow array cache. Call this in __init__."""
+        self._cache: Dict[str, Tuple[pa.Array, pa.Array]] = {}
+
+    def _clear_arrow_cache(self):
+        """Clear cached Arrow arrays to ensure fresh data after re-fitting."""
+        self._cache.clear()


Why do we need these 2?

To handle the case preprocessor.fit(A).fit(B).

Actually I found a bug and made a quick fix: #60031

I create a ticket to follow up: https://anyscale1.atlassian.net/browse/DATA-1788

Once we fix this PR, we can remove this cache.

Signed-off-by: xgui <xgui@anyscale.com>

python/ray/data/tests/preprocessors/test_encoder.py

alexeykudinkin · 2026-01-14T01:02:09Z

python/ray/data/preprocessors/encoder.py

+        """Clear cached Arrow arrays to ensure fresh data after re-fitting."""
+        self._cache.clear()
+
+    def _get_arrow_arrays(self, input_col: str) -> Tuple[pa.Array, pa.Array]:


@xinyuangui2 my concern with the caching implementation is following

We're adding complexity (which is fine by itself but needs to be clearly justified by performance win)

We're adding additional state that implementers of the Preprocessors now need to manage

Oh, on a second thought though, why don't we just use functools.cache on this method to remove this state from the Preprocessor itself?

The functools.cache doesn't have cache invalidation control which we want to include in fit

The cache doesn't only depend on the input_col, but also self.stats_ implicitly.

I updated the benchmark with TCP_H. I think the performance win is clear.

Oh, you're right, we can't use functools.cache unfortunately

Signed-off-by: xgui <xgui@anyscale.com>

python/ray/data/preprocessors/encoder.py

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Support arrow format in OneHotEncoder. Benchmark: TPC-H SF10 The improvement for `OneHotEncoder`: Before: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 374 | 2.67 | 2.66 | 2.86 | 2.93 | 2.51 | 3.33 | | 5 | 1,867 | 2.68 | 2.67 | 2.82 | 2.85 | 2.52 | 3.48 | | 10 | 3,590 | 2.79 | 2.77 | 2.91 | 3.33 | 2.61 | 3.47 | | 20 | 6,911 | 2.89 | 2.88 | 3.08 | 3.41 | 2.72 | 3.62 | | 50 | 15,841 | 3.16 | 3.14 | 3.28 | 3.58 | 3.00 | 3.76 | | 100 | 27,019 | 3.70 | 3.60 | 3.86 | 4.66 | 3.40 | 9.73 | After: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 1,187 | 0.84 | 0.82 | 0.92 | 1.21 | 0.80 | 1.32 | | 5 | 5,888 | 0.85 | 0.82 | 0.99 | 1.13 | 0.80 | 1.17 | | 10 | 12,144 | 0.82 | 0.81 | 0.86 | 0.94 | 0.79 | 1.19 | | 20 | 24,385 | 0.82 | 0.81 | 0.86 | 0.88 | 0.79 | 1.17 | | 50 | 59,987 | 0.83 | 0.82 | 0.87 | 0.91 | 0.81 | 1.18 | | 100 | 118,853 | 0.84 | 0.83 | 0.88 | 0.97 | 0.81 | 1.18 | ### Null behavior We keep the null behaviors the same as the old pandas implementations. | Encoder | Path | Null Input Behavior | Unseen Category Behavior | |---------|------|---------------------|--------------------------| | OrdinalEncoder | Pandas | **ValueError** | NaN | | OrdinalEncoder | Arrow | **ValueError** | null (NaN when converted) | | OneHotEncoder | Pandas | **ValueError** | all-zeros vector | | OneHotEncoder | Arrow | **ValueError** | all-zeros vector | --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>

Support arrow format in OneHotEncoder. Benchmark: TPC-H SF10 The improvement for `OneHotEncoder`: Before: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 374 | 2.67 | 2.66 | 2.86 | 2.93 | 2.51 | 3.33 | | 5 | 1,867 | 2.68 | 2.67 | 2.82 | 2.85 | 2.52 | 3.48 | | 10 | 3,590 | 2.79 | 2.77 | 2.91 | 3.33 | 2.61 | 3.47 | | 20 | 6,911 | 2.89 | 2.88 | 3.08 | 3.41 | 2.72 | 3.62 | | 50 | 15,841 | 3.16 | 3.14 | 3.28 | 3.58 | 3.00 | 3.76 | | 100 | 27,019 | 3.70 | 3.60 | 3.86 | 4.66 | 3.40 | 9.73 | After: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 1,187 | 0.84 | 0.82 | 0.92 | 1.21 | 0.80 | 1.32 | | 5 | 5,888 | 0.85 | 0.82 | 0.99 | 1.13 | 0.80 | 1.17 | | 10 | 12,144 | 0.82 | 0.81 | 0.86 | 0.94 | 0.79 | 1.19 | | 20 | 24,385 | 0.82 | 0.81 | 0.86 | 0.88 | 0.79 | 1.17 | | 50 | 59,987 | 0.83 | 0.82 | 0.87 | 0.91 | 0.81 | 1.18 | | 100 | 118,853 | 0.84 | 0.83 | 0.88 | 0.97 | 0.81 | 1.18 | ### Null behavior We keep the null behaviors the same as the old pandas implementations. | Encoder | Path | Null Input Behavior | Unseen Category Behavior | |---------|------|---------------------|--------------------------| | OrdinalEncoder | Pandas | **ValueError** | NaN | | OrdinalEncoder | Arrow | **ValueError** | null (NaN when converted) | | OneHotEncoder | Pandas | **ValueError** | all-zeros vector | | OneHotEncoder | Arrow | **ValueError** | all-zeros vector | --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com>

## Description Rename fields in preprocessors to conform to naming convention for private fields in classes ## Related issues Fixes variable naming issue reported in #59890 ## Additional information backwards compatibility with `property` so that old variable names still provide access to the private variables --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Support arrow format in OneHotEncoder. Benchmark: TPC-H SF10 The improvement for `OneHotEncoder`: Before: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 374 | 2.67 | 2.66 | 2.86 | 2.93 | 2.51 | 3.33 | | 5 | 1,867 | 2.68 | 2.67 | 2.82 | 2.85 | 2.52 | 3.48 | | 10 | 3,590 | 2.79 | 2.77 | 2.91 | 3.33 | 2.61 | 3.47 | | 20 | 6,911 | 2.89 | 2.88 | 3.08 | 3.41 | 2.72 | 3.62 | | 50 | 15,841 | 3.16 | 3.14 | 3.28 | 3.58 | 3.00 | 3.76 | | 100 | 27,019 | 3.70 | 3.60 | 3.86 | 4.66 | 3.40 | 9.73 | After: | Batch Size | Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99 (ms) | Min (ms) | Max (ms) | |:----------:|----------------------:|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:| | 1 | 1,187 | 0.84 | 0.82 | 0.92 | 1.21 | 0.80 | 1.32 | | 5 | 5,888 | 0.85 | 0.82 | 0.99 | 1.13 | 0.80 | 1.17 | | 10 | 12,144 | 0.82 | 0.81 | 0.86 | 0.94 | 0.79 | 1.19 | | 20 | 24,385 | 0.82 | 0.81 | 0.86 | 0.88 | 0.79 | 1.17 | | 50 | 59,987 | 0.83 | 0.82 | 0.87 | 0.91 | 0.81 | 1.18 | | 100 | 118,853 | 0.84 | 0.83 | 0.88 | 0.97 | 0.81 | 1.18 | ### Null behavior We keep the null behaviors the same as the old pandas implementations. | Encoder | Path | Null Input Behavior | Unseen Category Behavior | |---------|------|---------------------|--------------------------| | OrdinalEncoder | Pandas | **ValueError** | NaN | | OrdinalEncoder | Arrow | **ValueError** | null (NaN when converted) | | OneHotEncoder | Pandas | **ValueError** | all-zeros vector | | OneHotEncoder | Arrow | **ValueError** | all-zeros vector | --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

xinyuangui2 and others added 16 commits November 17, 2025 16:47

Avoid lock if serialization result is cached

de4f17f

The GIL makes checking s`elf._serialize_cache is not None` atomic, so we don't need lock. Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Merge branch 'ray-project:master' into master

003b4ab

Merge branch 'ray-project:master' into master

93ab9d2

Merge branch 'ray-project:master' into master

e2cd6b8

Merge branch 'ray-project:master' into master

136ec12

Merge branch 'ray-project:master' into master

dc4258f

Merge branch 'ray-project:master' into master

80f2246

Merge branch 'ray-project:master' into master

52fb570

Merge branch 'ray-project:master' into master

3c42af4

Merge branch 'ray-project:master' into master

87333fe

Merge branch 'ray-project:master' into master

c4272db

Merge branch 'ray-project:master' into master

b6204ac

Merge branch 'ray-project:master' into master

da4e3c9

Merge branch 'ray-project:master' into master

94c1f59

Merge branch 'ray-project:master' into master

21268d3

batch optimization

f8fc4ea

Signed-off-by: xgui <xgui@anyscale.com>

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

xinyuangui2 added the go add ONLY when ready to merge, run all tests label Jan 6, 2026

xinyuangui2 and others added 5 commits January 6, 2026 17:48

add tests

0fcdef0

Signed-off-by: xgui <xgui@anyscale.com>

trigger image

f876ec1

Signed-off-by: xgui <xgui@anyscale.com>

Update _images.rayci.yml

40b2794

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Reduce vectorized threshold for performance tuning

2c79e43

Updated the threshold for switching between Python dict and PyArrow pc.index_in from 10000 to 50. Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

merge

6222ce6

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 changed the title ~~Preprocessor batch optimization~~ [Data] Cache Arrow arrays in OrdinalEncoder for batch processing Jan 6, 2026

xinyuangui2 marked this pull request as ready for review January 6, 2026 20:38

xinyuangui2 requested a review from a team as a code owner January 6, 2026 20:38

xinyuangui2 requested a review from gvspraveen January 6, 2026 20:38

cursor bot reviewed Jan 6, 2026

View reviewed changes

python/ray/data/preprocessors/encoder.py Outdated Show resolved Hide resolved

xinyuangui2 requested a review from raulchen January 6, 2026 21:23

add cache invalidation

ecd3a7d

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from goutamvenkat-anyscale January 8, 2026 03:22

remain the behaviors for null operations

95e77ce

Signed-off-by: xgui <xgui@anyscale.com>

cursor bot reviewed Jan 8, 2026

View reviewed changes

python/ray/data/preprocessors/encoder.py Show resolved Hide resolved

zero_copy false

ed032b4

Signed-off-by: xgui <xgui@anyscale.com>

cursor bot reviewed Jan 8, 2026

View reviewed changes

python/ray/data/preprocessors/encoder.py Show resolved Hide resolved

xinyuangui2 added 2 commits January 8, 2026 21:04

match behaviors

e2b7e28

Signed-off-by: xgui <xgui@anyscale.com>

fix test

b0fba94

Signed-off-by: xgui <xgui@anyscale.com>

iamjustinhsu approved these changes Jan 9, 2026

View reviewed changes

alexeykudinkin reviewed Jan 9, 2026

View reviewed changes

xinyuangui2 and others added 4 commits January 10, 2026 09:12

Merge branch 'master' into preprocessor-batch-optimization

3574bdd

resolve comments

fc923a4

Signed-off-by: xgui <xgui@anyscale.com>

Merge branch 'master' into preprocessor-batch-optimization

b2120b9

add one fixme

e7003be

Signed-off-by: xgui <xgui@anyscale.com>

alexeykudinkin reviewed Jan 14, 2026

View reviewed changes

python/ray/data/tests/preprocessors/test_encoder.py Outdated Show resolved Hide resolved

alexeykudinkin reviewed Jan 14, 2026

View reviewed changes

xinyuangui2 and others added 3 commits January 13, 2026 17:44

Merge branch 'master' into preprocessor-batch-optimization

e8b7915

remove manual setting

f7d248b

Signed-off-by: xgui <xgui@anyscale.com>

remove cache for now

5b82947

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 changed the title ~~[Data][2/N] Arrow format for OneHotEncoder and Cache Arrow arrays~~ [Data][2/N] Arrow format for OneHotEncoder Jan 14, 2026

cursor bot reviewed Jan 14, 2026

View reviewed changes

python/ray/data/preprocessors/encoder.py Show resolved Hide resolved

xinyuangui2 commented Jan 14, 2026

View reviewed changes

python/ray/data/preprocessors/encoder.py Outdated Show resolved Hide resolved

Update python/ray/data/preprocessors/encoder.py

5933e24

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

alexeykudinkin approved these changes Jan 14, 2026

View reviewed changes

alexeykudinkin merged commit 82cf17f into ray-project:master Jan 14, 2026
6 checks passed

rayhhome mentioned this pull request Feb 12, 2026

[Data] Rename encoder preprocessor private fields #61028

Merged

Conversation

xinyuangui2 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Null behavior

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamjustinhsu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xinyuangui2 commented Jan 6, 2026 •

edited

Loading

alexeykudinkin Jan 14, 2026 •

edited

Loading