Skip to content

[Data] Rename encoder preprocessor private fields#61028

Merged
bveeramani merged 62 commits intoray-project:masterfrom
rayhhome:rename-private-fields
Feb 25, 2026
Merged

[Data] Rename encoder preprocessor private fields#61028
bveeramani merged 62 commits intoray-project:masterfrom
rayhhome:rename-private-fields

Conversation

@rayhhome
Copy link
Copy Markdown
Contributor

Description

Rename fields in preprocessors to conform to naming convention for private fields in classes

Related issues

Fixes variable naming issue reported in #59890

Additional information

backwards compatibility with property so that old variable names still provide access to the private variables

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
… plan

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…reprocessor field

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Copilot AI review requested due to automatic review settings February 12, 2026 20:13
@rayhhome rayhhome requested a review from a team as a code owner February 12, 2026 20:13
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors several preprocessors to use private fields with a leading underscore, adding properties for backward compatibility. The changes are extensive and mostly look good. I've found a few minor issues: a typo in a setter method name in LabelEncoder, some missing return type hints in setters within scaler.py for consistency, and an inconsistency in attribute access in RobustScaler. Addressing these will improve code consistency and correctness.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Ray Data preprocessors to rename instance fields to underscore-prefixed private names (e.g. columns_columns) while keeping public access via @property to preserve the external API and align with the project’s private-field naming convention.

Changes:

  • Rename many preprocessor configuration fields to private (_columns, _output_columns, etc.) and update internal references accordingly.
  • Add @property getters/setters to preserve the previous public attribute surface (e.g. preprocessor.columns).
  • Update Preprocessor to store the stat computation plan in _stat_computation_plan and expose it via a property.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
python/ray/data/preprocessors/vectorizer.py Rename vectorizer config fields to private and add compatibility properties.
python/ray/data/preprocessors/transformer.py Rename transformer config fields to private and add compatibility properties.
python/ray/data/preprocessors/torch.py Add properties for existing private fields to preserve public attribute access.
python/ray/data/preprocessors/tokenizer.py Rename tokenizer config fields to private and add compatibility properties.
python/ray/data/preprocessors/scaler.py Rename scaler config fields to private and update stat plan access.
python/ray/data/preprocessors/normalizer.py Rename normalizer fields to private; add type hint for norm.
python/ray/data/preprocessors/imputer.py Rename imputer fields to private and update usage/serialization fields.
python/ray/data/preprocessors/hasher.py Rename hasher fields to private and add compatibility properties.
python/ray/data/preprocessors/encoder.py Rename encoder fields to private and update internal usage/serialization fields.
python/ray/data/preprocessors/discretizer.py Rename CustomKBinsDiscretizer fields to private and add compatibility properties.
python/ray/data/preprocessors/chain.py Rename preprocessors to _preprocessors and add a property for compatibility.
python/ray/data/preprocessor.py Move stat plan to _stat_computation_plan, update accessors, and adjust pickling behavior.
ci/lint/pydoclint-baseline.txt Remove an entry now satisfied by added type hints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…ty + remove preprocessor setter

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
@bveeramani bveeramani added the go add ONLY when ready to merge, run all tests label Feb 20, 2026
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

if self._strategy == "constant":
self._is_fittable = False

def __setstate__(self, state: Dict[str, Any]) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rayhhome, in what context would we be dealing with an old pickled object across ray versions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be when the user is using older versioned pickled preprocessors (like the ray_pickle.dumps in FrameworkCheckpoint). I believe ray_pickle applies standard Python unpickling, so __setstate__ would be used for restoring objects?

Copy link
Copy Markdown
Contributor

@iamjustinhsu iamjustinhsu Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, but i see that

Please correct me if i'm wrong. Will probably need to loop in @xinyuangui2 for context

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged in the interest on incremental progress.

Re: the additional complexity from implementing __setstate__ -- I think we should remove Predictor later this year. If the pickling is only relevant to Predictor, then I think we should remove all of the preprocessor pickling logic with it. I think the logic is non-trivial to read and maintain

@bveeramani bveeramani merged commit 09c7c76 into ray-project:master Feb 25, 2026
5 checks passed
bveeramani pushed a commit that referenced this pull request Mar 4, 2026
…61341)

## Description
The `SerializablePreprocessorBase` abstract class declares functions for
saving and loading preprocessor states and should be implemented by all
preprocessors. This PR implements the abstract methods for all
preprocessors that are not yet inheriting from this base class.

## Related issues
Related to #61028 , which implemented backwards compatibility for legacy
pickling layer. The `__setstate__` functions should be removed along
with the deprecated `Predictor`, while `_get_serializable_fields` and
`_set_serializable_fields` should be used instead for saving and loading
preprocessor states in future iterations.

## Additional information
Accompanied by new field serializing tests of all preprocessors
involved.

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
manhld0206 pushed a commit to manhld0206/ray that referenced this pull request Mar 5, 2026
…ay-project#61341)

## Description
The `SerializablePreprocessorBase` abstract class declares functions for
saving and loading preprocessor states and should be implemented by all
preprocessors. This PR implements the abstract methods for all
preprocessors that are not yet inheriting from this base class.

## Related issues
Related to ray-project#61028 , which implemented backwards compatibility for legacy
pickling layer. The `__setstate__` functions should be removed along
with the deprecated `Predictor`, while `_get_serializable_fields` and
`_set_serializable_fields` should be used instead for saving and loading
preprocessor states in future iterations.

## Additional information
Accompanied by new field serializing tests of all preprocessors
involved.

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
bittoby pushed a commit to bittoby/ray that referenced this pull request Mar 6, 2026
…ay-project#61341)

## Description
The `SerializablePreprocessorBase` abstract class declares functions for
saving and loading preprocessor states and should be implemented by all
preprocessors. This PR implements the abstract methods for all
preprocessors that are not yet inheriting from this base class.

## Related issues
Related to ray-project#61028 , which implemented backwards compatibility for legacy
pickling layer. The `__setstate__` functions should be removed along
with the deprecated `Predictor`, while `_get_serializable_fields` and
`_set_serializable_fields` should be used instead for saving and loading
preprocessor states in future iterations.

## Additional information
Accompanied by new field serializing tests of all preprocessors
involved.

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: bittoby <bittoby@users.noreply.github.com>
ParagEkbote pushed a commit to ParagEkbote/ray that referenced this pull request Mar 10, 2026
…ay-project#61341)

## Description
The `SerializablePreprocessorBase` abstract class declares functions for
saving and loading preprocessor states and should be implemented by all
preprocessors. This PR implements the abstract methods for all
preprocessors that are not yet inheriting from this base class.

## Related issues
Related to ray-project#61028 , which implemented backwards compatibility for legacy
pickling layer. The `__setstate__` functions should be removed along
with the deprecated `Predictor`, while `_get_serializable_fields` and
`_set_serializable_fields` should be used instead for saving and loading
preprocessor states in future iterations.

## Additional information
Accompanied by new field serializing tests of all preprocessors
involved.

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Mar 13, 2026
…ay-project#61341)

## Description
The `SerializablePreprocessorBase` abstract class declares functions for
saving and loading preprocessor states and should be implemented by all
preprocessors. This PR implements the abstract methods for all
preprocessors that are not yet inheriting from this base class.

## Related issues
Related to ray-project#61028 , which implemented backwards compatibility for legacy
pickling layer. The `__setstate__` functions should be removed along
with the deprecated `Predictor`, while `_get_serializable_fields` and
`_set_serializable_fields` should be used instead for saving and loading
preprocessor states in future iterations.

## Additional information
Accompanied by new field serializing tests of all preprocessors
involved.

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants