Skip to content

[Data] Make All Preprocessors Implement SerializablePreprocessorBase#61341

Merged
bveeramani merged 55 commits intoray-project:masterfrom
rayhhome:make-prepro-serializable
Mar 4, 2026
Merged

[Data] Make All Preprocessors Implement SerializablePreprocessorBase#61341
bveeramani merged 55 commits intoray-project:masterfrom
rayhhome:make-prepro-serializable

Conversation

@rayhhome
Copy link
Copy Markdown
Contributor

Description

The SerializablePreprocessorBase abstract class declares functions for saving and loading preprocessor states and should be implemented by all preprocessors. This PR implements the abstract methods for all preprocessors that are not yet inheriting from this base class.

Related issues

Related to #61028 , which implemented backwards compatibility for legacy pickling layer. The __setstate__ functions should be removed along with the deprecated Predictor, while _get_serializable_fields and _set_serializable_fields should be used instead for saving and loading preprocessor states in future iterations.

Additional information

Accompanied by new field serializing tests of all preprocessors involved.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
… plan

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…reprocessor field

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…ty + remove preprocessor setter

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…to make-prepro-serializable

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Copilot AI review requested due to automatic review settings February 26, 2026 03:06
@rayhhome rayhhome requested a review from a team as a code owner February 26, 2026 03:06
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements serialization support for preprocessors that were not yet inheriting from SerializablePreprocessorBase. The changes add the required abstract methods _get_serializable_fields() and _set_serializable_fields() to enable CloudPickle-based serialization for all preprocessors.

Changes:

  • Migrated 9 preprocessor classes to inherit from SerializablePreprocessorBase and implement serialization methods
  • Added comprehensive serialization tests for all migrated preprocessors
  • Updated imports and added @SerializablePreprocessor decorator with version and identifier metadata

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
python/ray/data/preprocessors/vectorizer.py Added serialization support for HashingVectorizer and CountVectorizer
python/ray/data/preprocessors/transformer.py Added serialization support for PowerTransformer
python/ray/data/preprocessors/torch.py Added serialization support for TorchVisionPreprocessor
python/ray/data/preprocessors/tokenizer.py Added serialization support for Tokenizer
python/ray/data/preprocessors/normalizer.py Added serialization support for Normalizer
python/ray/data/preprocessors/hasher.py Added serialization support for FeatureHasher
python/ray/data/preprocessors/discretizer.py Added serialization support for CustomKBinsDiscretizer and UniformKBinsDiscretizer
python/ray/data/preprocessors/concatenator.py Added serialization support for Concatenator
python/ray/data/preprocessors/chain.py Added serialization support for Chain preprocessor
python/ray/data/tests/preprocessors/test_vectorizer.py Added serialization tests for vectorizers
python/ray/data/tests/preprocessors/test_transformer.py Added serialization test for PowerTransformer
python/ray/data/tests/preprocessors/test_torch.py Added serialization test for TorchVisionPreprocessor
python/ray/data/tests/preprocessors/test_tokenizer.py Added serialization test for Tokenizer
python/ray/data/tests/preprocessors/test_normalizer.py Added serialization test for Normalizer
python/ray/data/tests/preprocessors/test_hasher.py Added serialization test for FeatureHasher
python/ray/data/tests/preprocessors/test_discretizer.py Added serialization tests for discretizers
python/ray/data/tests/preprocessors/test_concatenator.py Added serialization test for Concatenator
python/ray/data/tests/preprocessors/test_chain.py Added serialization test for Chain

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
@ray-gardener ray-gardener bot added the community-contribution Contributed by the community label Feb 26, 2026
@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Feb 27, 2026
@rayhhome rayhhome self-assigned this Feb 28, 2026
@bveeramani bveeramani merged commit b9cbe1f into ray-project:master Mar 4, 2026
6 checks passed
manhld0206 pushed a commit to manhld0206/ray that referenced this pull request Mar 5, 2026
…ay-project#61341)

## Description
The `SerializablePreprocessorBase` abstract class declares functions for
saving and loading preprocessor states and should be implemented by all
preprocessors. This PR implements the abstract methods for all
preprocessors that are not yet inheriting from this base class.

## Related issues
Related to ray-project#61028 , which implemented backwards compatibility for legacy
pickling layer. The `__setstate__` functions should be removed along
with the deprecated `Predictor`, while `_get_serializable_fields` and
`_set_serializable_fields` should be used instead for saving and loading
preprocessor states in future iterations.

## Additional information
Accompanied by new field serializing tests of all preprocessors
involved.

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Mạnh Lê Đức <naruto12308@gmail.com>
bittoby pushed a commit to bittoby/ray that referenced this pull request Mar 6, 2026
…ay-project#61341)

## Description
The `SerializablePreprocessorBase` abstract class declares functions for
saving and loading preprocessor states and should be implemented by all
preprocessors. This PR implements the abstract methods for all
preprocessors that are not yet inheriting from this base class.

## Related issues
Related to ray-project#61028 , which implemented backwards compatibility for legacy
pickling layer. The `__setstate__` functions should be removed along
with the deprecated `Predictor`, while `_get_serializable_fields` and
`_set_serializable_fields` should be used instead for saving and loading
preprocessor states in future iterations.

## Additional information
Accompanied by new field serializing tests of all preprocessors
involved.

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: bittoby <bittoby@users.noreply.github.com>
ParagEkbote pushed a commit to ParagEkbote/ray that referenced this pull request Mar 10, 2026
…ay-project#61341)

## Description
The `SerializablePreprocessorBase` abstract class declares functions for
saving and loading preprocessor states and should be implemented by all
preprocessors. This PR implements the abstract methods for all
preprocessors that are not yet inheriting from this base class.

## Related issues
Related to ray-project#61028 , which implemented backwards compatibility for legacy
pickling layer. The `__setstate__` functions should be removed along
with the deprecated `Predictor`, while `_get_serializable_fields` and
`_set_serializable_fields` should be used instead for saving and loading
preprocessor states in future iterations.

## Additional information
Accompanied by new field serializing tests of all preprocessors
involved.

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Mar 13, 2026
…ay-project#61341)

## Description
The `SerializablePreprocessorBase` abstract class declares functions for
saving and loading preprocessor states and should be implemented by all
preprocessors. This PR implements the abstract methods for all
preprocessors that are not yet inheriting from this base class.

## Related issues
Related to ray-project#61028 , which implemented backwards compatibility for legacy
pickling layer. The `__setstate__` functions should be removed along
with the deprecated `Predictor`, while `_get_serializable_fields` and
`_set_serializable_fields` should be used instead for saving and loading
preprocessor states in future iterations.

## Additional information
Accompanied by new field serializing tests of all preprocessors
involved.

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants