[Data] Rename encoder preprocessor private fields#61028
[Data] Rename encoder preprocessor private fields#61028bveeramani merged 62 commits intoray-project:masterfrom
Conversation
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
… plan Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…reprocessor field Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request refactors several preprocessors to use private fields with a leading underscore, adding properties for backward compatibility. The changes are extensive and mostly look good. I've found a few minor issues: a typo in a setter method name in LabelEncoder, some missing return type hints in setters within scaler.py for consistency, and an inconsistency in attribute access in RobustScaler. Addressing these will improve code consistency and correctness.
There was a problem hiding this comment.
Pull request overview
This PR refactors Ray Data preprocessors to rename instance fields to underscore-prefixed private names (e.g. columns → _columns) while keeping public access via @property to preserve the external API and align with the project’s private-field naming convention.
Changes:
- Rename many preprocessor configuration fields to private (
_columns,_output_columns, etc.) and update internal references accordingly. - Add
@propertygetters/setters to preserve the previous public attribute surface (e.g.preprocessor.columns). - Update
Preprocessorto store the stat computation plan in_stat_computation_planand expose it via a property.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| python/ray/data/preprocessors/vectorizer.py | Rename vectorizer config fields to private and add compatibility properties. |
| python/ray/data/preprocessors/transformer.py | Rename transformer config fields to private and add compatibility properties. |
| python/ray/data/preprocessors/torch.py | Add properties for existing private fields to preserve public attribute access. |
| python/ray/data/preprocessors/tokenizer.py | Rename tokenizer config fields to private and add compatibility properties. |
| python/ray/data/preprocessors/scaler.py | Rename scaler config fields to private and update stat plan access. |
| python/ray/data/preprocessors/normalizer.py | Rename normalizer fields to private; add type hint for norm. |
| python/ray/data/preprocessors/imputer.py | Rename imputer fields to private and update usage/serialization fields. |
| python/ray/data/preprocessors/hasher.py | Rename hasher fields to private and add compatibility properties. |
| python/ray/data/preprocessors/encoder.py | Rename encoder fields to private and update internal usage/serialization fields. |
| python/ray/data/preprocessors/discretizer.py | Rename CustomKBinsDiscretizer fields to private and add compatibility properties. |
| python/ray/data/preprocessors/chain.py | Rename preprocessors to _preprocessors and add a property for compatibility. |
| python/ray/data/preprocessor.py | Move stat plan to _stat_computation_plan, update accessors, and adjust pickling behavior. |
| ci/lint/pydoclint-baseline.txt | Remove an entry now satisfied by added type hints. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…ty + remove preprocessor setter Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…rename-private-fields
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…rename-private-fields
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
| if self._strategy == "constant": | ||
| self._is_fittable = False | ||
|
|
||
| def __setstate__(self, state: Dict[str, Any]) -> None: |
There was a problem hiding this comment.
@rayhhome, in what context would we be dealing with an old pickled object across ray versions?
There was a problem hiding this comment.
I think it would be when the user is using older versioned pickled preprocessors (like the ray_pickle.dumps in FrameworkCheckpoint). I believe ray_pickle applies standard Python unpickling, so __setstate__ would be used for restoring objects?
There was a problem hiding this comment.
i see, but i see that
FrameworkCheckpointisn't labeled as a@PublicAPI- https://github.com/anyscale/rayturbo/blob/fada4124e58de1ac14a9a33ae303edc0108d04b8/python/ray/train/predictor.py#L94 looks like loading from a checkpoint is deprecated?
Please correct me if i'm wrong. Will probably need to loop in @xinyuangui2 for context
There was a problem hiding this comment.
Merged in the interest on incremental progress.
Re: the additional complexity from implementing __setstate__ -- I think we should remove Predictor later this year. If the pickling is only relevant to Predictor, then I think we should remove all of the preprocessor pickling logic with it. I think the logic is non-trivial to read and maintain
…61341) ## Description The `SerializablePreprocessorBase` abstract class declares functions for saving and loading preprocessor states and should be implemented by all preprocessors. This PR implements the abstract methods for all preprocessors that are not yet inheriting from this base class. ## Related issues Related to #61028 , which implemented backwards compatibility for legacy pickling layer. The `__setstate__` functions should be removed along with the deprecated `Predictor`, while `_get_serializable_fields` and `_set_serializable_fields` should be used instead for saving and loading preprocessor states in future iterations. ## Additional information Accompanied by new field serializing tests of all preprocessors involved. --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…ay-project#61341) ## Description The `SerializablePreprocessorBase` abstract class declares functions for saving and loading preprocessor states and should be implemented by all preprocessors. This PR implements the abstract methods for all preprocessors that are not yet inheriting from this base class. ## Related issues Related to ray-project#61028 , which implemented backwards compatibility for legacy pickling layer. The `__setstate__` functions should be removed along with the deprecated `Predictor`, while `_get_serializable_fields` and `_set_serializable_fields` should be used instead for saving and loading preprocessor states in future iterations. ## Additional information Accompanied by new field serializing tests of all preprocessors involved. --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com> Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
…ay-project#61341) ## Description The `SerializablePreprocessorBase` abstract class declares functions for saving and loading preprocessor states and should be implemented by all preprocessors. This PR implements the abstract methods for all preprocessors that are not yet inheriting from this base class. ## Related issues Related to ray-project#61028 , which implemented backwards compatibility for legacy pickling layer. The `__setstate__` functions should be removed along with the deprecated `Predictor`, while `_get_serializable_fields` and `_set_serializable_fields` should be used instead for saving and loading preprocessor states in future iterations. ## Additional information Accompanied by new field serializing tests of all preprocessors involved. --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com> Signed-off-by: bittoby <bittoby@users.noreply.github.com>
…ay-project#61341) ## Description The `SerializablePreprocessorBase` abstract class declares functions for saving and loading preprocessor states and should be implemented by all preprocessors. This PR implements the abstract methods for all preprocessors that are not yet inheriting from this base class. ## Related issues Related to ray-project#61028 , which implemented backwards compatibility for legacy pickling layer. The `__setstate__` functions should be removed along with the deprecated `Predictor`, while `_get_serializable_fields` and `_set_serializable_fields` should be used instead for saving and loading preprocessor states in future iterations. ## Additional information Accompanied by new field serializing tests of all preprocessors involved. --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com> Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
…ay-project#61341) ## Description The `SerializablePreprocessorBase` abstract class declares functions for saving and loading preprocessor states and should be implemented by all preprocessors. This PR implements the abstract methods for all preprocessors that are not yet inheriting from this base class. ## Related issues Related to ray-project#61028 , which implemented backwards compatibility for legacy pickling layer. The `__setstate__` functions should be removed along with the deprecated `Predictor`, while `_get_serializable_fields` and `_set_serializable_fields` should be used instead for saving and loading preprocessor states in future iterations. ## Additional information Accompanied by new field serializing tests of all preprocessors involved. --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Description
Rename fields in preprocessors to conform to naming convention for private fields in classes
Related issues
Fixes variable naming issue reported in #59890
Additional information
backwards compatibility with
propertyso that old variable names still provide access to the private variables