[Data] [Preprocessor] Fix stale stats on refit by xinyuangui2 · Pull Request #60031 · ray-project/ray

xinyuangui2 · 2026-01-10T23:40:02Z

Why are these changes needed?

When fit() is called multiple times on a Preprocessor, the stats_ dict was not being reset before computing new stats. This caused stale stats from previous fit() calls to persist when stat keys are data-dependent (e.g., when columns are auto-detected from the data).

This violates the documented behavior in the fit() docstring:

Calling it more than once will overwrite all previously fitted state:
preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B).

Example of the bug:

# Preprocessor that auto-detects columns from data
preprocessor = DataDependentPreprocessor()

# Dataset A has columns: a, b
dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...])
preprocessor.fit(dataset_a)
# stats_ = {"mean(a)": 2.0, "mean(b)": 20.0}

# Dataset B has columns: b, c (no "a" column)
dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...])
preprocessor.fit(dataset_b)

# BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0}
#      "mean(a)" is STALE - it should not exist!
# EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0}

The GIL makes checking s`elf._serialize_cache is not None` atomic, so we don't need lock. Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

When fit() is called multiple times on a Preprocessor, the stats_ dict was not being reset before computing new stats. This caused stale stats from previous fit() calls to persist when stat keys are data-dependent (e.g., when columns are auto-detected from the data). This violates the documented behavior that 'fit(A).fit(B) is equivalent to fit(B)'. Fix: Reset stats_ to an empty dict at the start of fit() before computing new stats. Added unit test to verify stale stats are properly cleared on refit.

Signed-off-by: xgui <xgui@anyscale.com>

gemini-code-assist

Code Review

This pull request correctly fixes a bug where Preprocessor.fit() failed to clear stale statistics on subsequent calls, ensuring that fit(A).fit(B) is equivalent to fit(B). The fix is simple and effective, and the newly added test case test_fit_twice_clears_stale_stats thoroughly validates this behavior. I've suggested a couple of minor improvements to the test assertions to make them more concise and robust.

python/ray/data/tests/preprocessors/test_preprocessors.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

### Why are these changes needed? When `fit()` is called multiple times on a `Preprocessor`, the `stats_` dict was not being reset before computing new stats. This caused stale stats from previous `fit()` calls to persist when stat keys are data-dependent (e.g., when columns are auto-detected from the data). This violates the documented behavior in the `fit()` docstring: > Calling it more than once will overwrite all previously fitted state: > `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`. **Example of the bug:** ```python # Preprocessor that auto-detects columns from data preprocessor = DataDependentPreprocessor() # Dataset A has columns: a, b dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...]) preprocessor.fit(dataset_a) # stats_ = {"mean(a)": 2.0, "mean(b)": 20.0} # Dataset B has columns: b, c (no "a" column) dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...]) preprocessor.fit(dataset_b) # BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0} # "mean(a)" is STALE - it should not exist! # EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0} ``` --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

### Why are these changes needed? When `fit()` is called multiple times on a `Preprocessor`, the `stats_` dict was not being reset before computing new stats. This caused stale stats from previous `fit()` calls to persist when stat keys are data-dependent (e.g., when columns are auto-detected from the data). This violates the documented behavior in the `fit()` docstring: > Calling it more than once will overwrite all previously fitted state: > `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`. **Example of the bug:** ```python # Preprocessor that auto-detects columns from data preprocessor = DataDependentPreprocessor() # Dataset A has columns: a, b dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...]) preprocessor.fit(dataset_a) # stats_ = {"mean(a)": 2.0, "mean(b)": 20.0} # Dataset B has columns: b, c (no "a" column) dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...]) preprocessor.fit(dataset_b) # BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0} # "mean(a)" is STALE - it should not exist! # EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0} ``` --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>

### Why are these changes needed? When `fit()` is called multiple times on a `Preprocessor`, the `stats_` dict was not being reset before computing new stats. This caused stale stats from previous `fit()` calls to persist when stat keys are data-dependent (e.g., when columns are auto-detected from the data). This violates the documented behavior in the `fit()` docstring: > Calling it more than once will overwrite all previously fitted state: > `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`. **Example of the bug:** ```python # Preprocessor that auto-detects columns from data preprocessor = DataDependentPreprocessor() # Dataset A has columns: a, b dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...]) preprocessor.fit(dataset_a) # stats_ = {"mean(a)": 2.0, "mean(b)": 20.0} # Dataset B has columns: b, c (no "a" column) dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...]) preprocessor.fit(dataset_b) # BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0} # "mean(a)" is STALE - it should not exist! # EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0} ``` --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

### Why are these changes needed? When `fit()` is called multiple times on a `Preprocessor`, the `stats_` dict was not being reset before computing new stats. This caused stale stats from previous `fit()` calls to persist when stat keys are data-dependent (e.g., when columns are auto-detected from the data). This violates the documented behavior in the `fit()` docstring: > Calling it more than once will overwrite all previously fitted state: > `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`. **Example of the bug:** ```python # Preprocessor that auto-detects columns from data preprocessor = DataDependentPreprocessor() # Dataset A has columns: a, b dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...]) preprocessor.fit(dataset_a) # stats_ = {"mean(a)": 2.0, "mean(b)": 20.0} # Dataset B has columns: b, c (no "a" column) dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...]) preprocessor.fit(dataset_b) # BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0} # "mean(a)" is STALE - it should not exist! # EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0} ``` --------- Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

xinyuangui2 and others added 19 commits November 17, 2025 16:47

Avoid lock if serialization result is cached

de4f17f

The GIL makes checking s`elf._serialize_cache is not None` atomic, so we don't need lock. Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Merge branch 'ray-project:master' into master

003b4ab

Merge branch 'ray-project:master' into master

93ab9d2

Merge branch 'ray-project:master' into master

e2cd6b8

Merge branch 'ray-project:master' into master

136ec12

Merge branch 'ray-project:master' into master

dc4258f

Merge branch 'ray-project:master' into master

80f2246

Merge branch 'ray-project:master' into master

52fb570

Merge branch 'ray-project:master' into master

3c42af4

Merge branch 'ray-project:master' into master

87333fe

Merge branch 'ray-project:master' into master

c4272db

Merge branch 'ray-project:master' into master

b6204ac

Merge branch 'ray-project:master' into master

da4e3c9

Merge branch 'ray-project:master' into master

94c1f59

Merge branch 'ray-project:master' into master

21268d3

Merge branch 'ray-project:master' into master

c756582

Merge branch 'ray-project:master' into master

515b058

put import to head

3e826ca

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from a team as a code owner January 10, 2026 23:40

xinyuangui2 requested a review from alexeykudinkin January 10, 2026 23:40

xinyuangui2 mentioned this pull request Jan 10, 2026

[Data][2/N] Arrow format for OneHotEncoder #59890

Merged

gemini-code-assist bot reviewed Jan 10, 2026

View reviewed changes

python/ray/data/tests/preprocessors/test_preprocessors.py Outdated Show resolved Hide resolved

python/ray/data/tests/preprocessors/test_preprocessors.py Outdated Show resolved Hide resolved

ray-gardener bot added the data Ray Data-related issues label Jan 11, 2026

xinyuangui2 and others added 3 commits January 10, 2026 17:40

Update python/ray/data/tests/preprocessors/test_preprocessors.py

7b00408

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Update python/ray/data/tests/preprocessors/test_preprocessors.py

bbfaec4

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

Merge branch 'master' into fix-stale-stats-on-refit

6230a2c

xinyuangui2 added the go add ONLY when ready to merge, run all tests label Jan 12, 2026

alexeykudinkin approved these changes Jan 13, 2026

View reviewed changes

alexeykudinkin merged commit 505b370 into ray-project:master Jan 13, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] [Preprocessor] Fix stale stats on refit#60031

[Data] [Preprocessor] Fix stale stats on refit#60031
alexeykudinkin merged 22 commits intoray-project:masterfrom
xinyuangui2:fix-stale-stats-on-refit

xinyuangui2 commented Jan 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xinyuangui2 commented Jan 10, 2026

Why are these changes needed?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants