Revert "[Data] Fix DataContext deserialization issue with StatsActor"#59458
Revert "[Data] Fix DataContext deserialization issue with StatsActor"#59458bveeramani merged 1 commit intomasterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request reverts the changes from PR #59387, which introduced DataContextMetadata to handle DataContext deserialization issues with StatsActor. The revert removes the DataContextMetadata class and its usage, passing the DataContext object directly to the StatsActor.
My main concern is that this re-introduces a bug where StatsActor can fail to deserialize a DataContext if it contains custom types not available in the actor's environment. This is a known issue that the original PR aimed to fix, and the corresponding test case has also been removed. While I understand this revert may be necessary to address other regressions, this side-effect should be acknowledged.
| # Convert DataContext to DataContextMetadata before serialization to avoid | ||
| # module dependency issues during Ray's cloudpickle serialization. | ||
| data_context = DataContextMetadata.from_data_context(data_context) | ||
|
|
There was a problem hiding this comment.
This revert removes the DataContextMetadata wrapper, which was used to sanitize DataContext before sending it to the StatsActor. By passing the DataContext object directly to the remote register_dataset method, this change re-introduces a potential deserialization issue.
If the DataContext contains custom types (e.g., custom exception classes in actor_task_retry_on_errors) that are not present in the StatsActor's runtime environment, the remote call will fail with a ModuleNotFoundError or similar deserialization error during unpickling on the actor side.
The removed test, test_data_context_with_custom_classes_serialization, was specifically designed to prevent this regression. While this revert might be necessary to fix another issue, it's important to be aware that this re-introduces the original bug.
|
hey @dragongu , we have to revert this PR because it's failing test_stats. |
There was a problem hiding this comment.
Bug: DataContext serialization fails with custom exception classes
This revert re-introduces a serialization bug. When DataContext is passed directly to stats_actor.register_dataset.remote(), Ray serializes it via cloudpickle. If DataContext contains custom exception classes (via actor_task_retry_on_errors), those class references get pickled. When the StatsActor (potentially running in a different job/process) deserializes the arguments, it fails with ModuleNotFoundError because those custom modules aren't available in the actor's runtime environment. The sanitize_for_struct() call at export time comes too late - the failure occurs during deserialization.
python/ray/data/_internal/stats.py#L897-L904
ray/python/ray/data/_internal/stats.py
Lines 897 to 904 in 0cebb85
|
test on postmerge for the failing test on python 3.12: https://buildkite.com/ray-project/postmerge/builds/14975 |
…ray-project#59458) Reverts ray-project#59387 Signed-off-by: kriyanshii <kriyanshishah06@gmail.com>
…ray-project#59458) Reverts ray-project#59387 Signed-off-by: peterxcli <peterxcli@gmail.com>
Reverts #59387