Skip to content

Revert "[Data] Fix DataContext deserialization issue with StatsActor"#59458

Merged
bveeramani merged 1 commit intomasterfrom
revert-59387-fix/stats_actor
Dec 16, 2025
Merged

Revert "[Data] Fix DataContext deserialization issue with StatsActor"#59458
bveeramani merged 1 commit intomasterfrom
revert-59387-fix/stats_actor

Conversation

@raulchen
Copy link
Copy Markdown
Contributor

Reverts #59387

@raulchen raulchen requested a review from a team as a code owner December 16, 2025 00:31
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts the changes from PR #59387, which introduced DataContextMetadata to handle DataContext deserialization issues with StatsActor. The revert removes the DataContextMetadata class and its usage, passing the DataContext object directly to the StatsActor.

My main concern is that this re-introduces a bug where StatsActor can fail to deserialize a DataContext if it contains custom types not available in the actor's environment. This is a known issue that the original PR aimed to fix, and the corresponding test case has also been removed. While I understand this revert may be necessary to address other regressions, this side-effect should be acknowledged.

# Convert DataContext to DataContextMetadata before serialization to avoid
# module dependency issues during Ray's cloudpickle serialization.
data_context = DataContextMetadata.from_data_context(data_context)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This revert removes the DataContextMetadata wrapper, which was used to sanitize DataContext before sending it to the StatsActor. By passing the DataContext object directly to the remote register_dataset method, this change re-introduces a potential deserialization issue.

If the DataContext contains custom types (e.g., custom exception classes in actor_task_retry_on_errors) that are not present in the StatsActor's runtime environment, the remote call will fail with a ModuleNotFoundError or similar deserialization error during unpickling on the actor side.

The removed test, test_data_context_with_custom_classes_serialization, was specifically designed to prevent this regression. While this revert might be necessary to fix another issue, it's important to be aware that this re-introduces the original bug.

@raulchen
Copy link
Copy Markdown
Contributor Author

hey @dragongu , we have to revert this PR because it's failing test_stats.
Please take a look at the failure and resubmit the PR when you have a chance. thanks.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: DataContext serialization fails with custom exception classes

This revert re-introduces a serialization bug. When DataContext is passed directly to stats_actor.register_dataset.remote(), Ray serializes it via cloudpickle. If DataContext contains custom exception classes (via actor_task_retry_on_errors), those class references get pickled. When the StatsActor (potentially running in a different job/process) deserializes the arguments, it fails with ModuleNotFoundError because those custom modules aren't available in the actor's runtime environment. The sanitize_for_struct() call at export time comes too late - the failure occurs during deserialization.

python/ray/data/_internal/stats.py#L897-L904

get_or_create_stats_actor().register_dataset.remote(
ray.get_runtime_context().get_job_id(),
dataset_tag,
operator_tags,
topology,
data_context,
)

Fix in Cursor Fix in Web


@bveeramani bveeramani enabled auto-merge (squash) December 16, 2025 00:38
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label Dec 16, 2025
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Dec 16, 2025
@aslonnie
Copy link
Copy Markdown
Collaborator

test on postmerge for the failing test on python 3.12: https://buildkite.com/ray-project/postmerge/builds/14975

@bveeramani bveeramani merged commit 694e6fd into master Dec 16, 2025
5 of 7 checks passed
@bveeramani bveeramani deleted the revert-59387-fix/stats_actor branch December 16, 2025 01:57
kriyanshii pushed a commit to kriyanshii/ray that referenced this pull request Dec 16, 2025
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants