Skip to content

[Data] [Preprocessor] Fix stale stats on refit#60031

Merged
alexeykudinkin merged 22 commits intoray-project:masterfrom
xinyuangui2:fix-stale-stats-on-refit
Jan 13, 2026
Merged

[Data] [Preprocessor] Fix stale stats on refit#60031
alexeykudinkin merged 22 commits intoray-project:masterfrom
xinyuangui2:fix-stale-stats-on-refit

Conversation

@xinyuangui2
Copy link
Copy Markdown
Contributor

Why are these changes needed?

When fit() is called multiple times on a Preprocessor, the stats_ dict was not being reset before computing new stats. This caused stale stats from previous fit() calls to persist when stat keys are data-dependent (e.g., when columns are auto-detected from the data).

This violates the documented behavior in the fit() docstring:

Calling it more than once will overwrite all previously fitted state:
preprocessor.fit(A).fit(B) is equivalent to preprocessor.fit(B).

Example of the bug:

# Preprocessor that auto-detects columns from data
preprocessor = DataDependentPreprocessor()

# Dataset A has columns: a, b
dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...])
preprocessor.fit(dataset_a)
# stats_ = {"mean(a)": 2.0, "mean(b)": 20.0}

# Dataset B has columns: b, c (no "a" column)
dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...])
preprocessor.fit(dataset_b)

# BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0}
#      "mean(a)" is STALE - it should not exist!
# EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0}

xinyuangui2 and others added 19 commits November 17, 2025 16:47
The GIL makes checking s`elf._serialize_cache is not None` atomic, so we don't need lock.

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
When fit() is called multiple times on a Preprocessor, the stats_ dict
was not being reset before computing new stats. This caused stale stats
from previous fit() calls to persist when stat keys are data-dependent
(e.g., when columns are auto-detected from the data).

This violates the documented behavior that 'fit(A).fit(B) is equivalent
to fit(B)'.

Fix: Reset stats_ to an empty dict at the start of fit() before
computing new stats.

Added unit test to verify stale stats are properly cleared on refit.
Signed-off-by: xgui <xgui@anyscale.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a bug where Preprocessor.fit() failed to clear stale statistics on subsequent calls, ensuring that fit(A).fit(B) is equivalent to fit(B). The fix is simple and effective, and the newly added test case test_fit_twice_clears_stale_stats thoroughly validates this behavior. I've suggested a couple of minor improvements to the test assertions to make them more concise and robust.

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Jan 11, 2026
xinyuangui2 and others added 3 commits January 10, 2026 17:40
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
@xinyuangui2 xinyuangui2 added the go add ONLY when ready to merge, run all tests label Jan 12, 2026
@alexeykudinkin alexeykudinkin merged commit 505b370 into ray-project:master Jan 13, 2026
7 checks passed
rushikeshadhav pushed a commit to rushikeshadhav/ray that referenced this pull request Jan 14, 2026
### Why are these changes needed?

When `fit()` is called multiple times on a `Preprocessor`, the `stats_`
dict was not being reset before computing new stats. This caused stale
stats from previous `fit()` calls to persist when stat keys are
data-dependent (e.g., when columns are auto-detected from the data).

This violates the documented behavior in the `fit()` docstring:

> Calling it more than once will overwrite all previously fitted state:
> `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`.

**Example of the bug:**

```python
# Preprocessor that auto-detects columns from data
preprocessor = DataDependentPreprocessor()

# Dataset A has columns: a, b
dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...])
preprocessor.fit(dataset_a)
# stats_ = {"mean(a)": 2.0, "mean(b)": 20.0}

# Dataset B has columns: b, c (no "a" column)
dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...])
preprocessor.fit(dataset_b)

# BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0}
#      "mean(a)" is STALE - it should not exist!
# EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0}
```

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
jeffery4011 pushed a commit to jeffery4011/ray that referenced this pull request Jan 20, 2026
### Why are these changes needed?

When `fit()` is called multiple times on a `Preprocessor`, the `stats_`
dict was not being reset before computing new stats. This caused stale
stats from previous `fit()` calls to persist when stat keys are
data-dependent (e.g., when columns are auto-detected from the data).

This violates the documented behavior in the `fit()` docstring:

> Calling it more than once will overwrite all previously fitted state:
> `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`.

**Example of the bug:**

```python
# Preprocessor that auto-detects columns from data
preprocessor = DataDependentPreprocessor()

# Dataset A has columns: a, b
dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...])
preprocessor.fit(dataset_a)
# stats_ = {"mean(a)": 2.0, "mean(b)": 20.0}

# Dataset B has columns: b, c (no "a" column)
dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...])
preprocessor.fit(dataset_b)

# BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0}
#      "mean(a)" is STALE - it should not exist!
# EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0}
```

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
### Why are these changes needed?

When `fit()` is called multiple times on a `Preprocessor`, the `stats_`
dict was not being reset before computing new stats. This caused stale
stats from previous `fit()` calls to persist when stat keys are
data-dependent (e.g., when columns are auto-detected from the data).

This violates the documented behavior in the `fit()` docstring:

> Calling it more than once will overwrite all previously fitted state:
> `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`.

**Example of the bug:**

```python
# Preprocessor that auto-detects columns from data
preprocessor = DataDependentPreprocessor()

# Dataset A has columns: a, b
dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...])
preprocessor.fit(dataset_a)
# stats_ = {"mean(a)": 2.0, "mean(b)": 20.0}

# Dataset B has columns: b, c (no "a" column)
dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...])
preprocessor.fit(dataset_b)

# BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0}
#      "mean(a)" is STALE - it should not exist!
# EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0}
```

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
### Why are these changes needed?

When `fit()` is called multiple times on a `Preprocessor`, the `stats_`
dict was not being reset before computing new stats. This caused stale
stats from previous `fit()` calls to persist when stat keys are
data-dependent (e.g., when columns are auto-detected from the data).

This violates the documented behavior in the `fit()` docstring:

> Calling it more than once will overwrite all previously fitted state:
> `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`.

**Example of the bug:**

```python
# Preprocessor that auto-detects columns from data
preprocessor = DataDependentPreprocessor()

# Dataset A has columns: a, b
dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...])
preprocessor.fit(dataset_a)
# stats_ = {"mean(a)": 2.0, "mean(b)": 20.0}

# Dataset B has columns: b, c (no "a" column)
dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...])
preprocessor.fit(dataset_b)

# BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0}
#      "mean(a)" is STALE - it should not exist!
# EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0}
```

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
### Why are these changes needed?

When `fit()` is called multiple times on a `Preprocessor`, the `stats_`
dict was not being reset before computing new stats. This caused stale
stats from previous `fit()` calls to persist when stat keys are
data-dependent (e.g., when columns are auto-detected from the data).

This violates the documented behavior in the `fit()` docstring:

> Calling it more than once will overwrite all previously fitted state:
> `preprocessor.fit(A).fit(B)` is equivalent to `preprocessor.fit(B)`.

**Example of the bug:**

```python
# Preprocessor that auto-detects columns from data
preprocessor = DataDependentPreprocessor()

# Dataset A has columns: a, b
dataset_a = ray.data.from_items([{"a": 1.0, "b": 10.0}, ...])
preprocessor.fit(dataset_a)
# stats_ = {"mean(a)": 2.0, "mean(b)": 20.0}

# Dataset B has columns: b, c (no "a" column)
dataset_b = ray.data.from_items([{"b": 100.0, "c": 1000.0}, ...])
preprocessor.fit(dataset_b)

# BUG: stats_ = {"mean(a)": 2.0, "mean(b)": 200.0, "mean(c)": 2000.0}
#      "mean(a)" is STALE - it should not exist!
# EXPECTED: stats_ = {"mean(b)": 200.0, "mean(c)": 2000.0}
```

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants