Skip to content

Add DataContext + LogicalOp Args to Dataset Export#53554

Merged
raulchen merged 15 commits intoray-project:masterfrom
iamjustinhsu:jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter
Jun 9, 2025
Merged

Add DataContext + LogicalOp Args to Dataset Export#53554
raulchen merged 15 commits intoray-project:masterfrom
iamjustinhsu:jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter

Conversation

@iamjustinhsu
Copy link
Copy Markdown
Contributor

@iamjustinhsu iamjustinhsu commented Jun 4, 2025

Why are these changes needed?

The motivation is that many times we need to look at user code to understand their usage when debugging issues.
so I'd like to make sure we capture all the info needed in the metadata exporter

event_EXPORT_DATASET_METADATA.log
^ Here is an example file.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@iamjustinhsu iamjustinhsu force-pushed the jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter branch from 87694b0 to 6a6455f Compare June 4, 2025 15:57
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter branch from 6a6455f to 8fb69eb Compare June 4, 2025 15:57
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter branch from eac92c1 to 384d5fb Compare June 4, 2025 16:16
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter branch from 68610f1 to fb7131a Compare June 5, 2025 18:21
This reverts commit fb7131a.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
This reverts commit 384d5fb.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter branch from 85999e9 to 0b4b974 Compare June 5, 2025 20:22
…u/add-ctx-and-logical-ops-to-dataset-metadata-exporter
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter branch from 20d1bf5 to c12a2c9 Compare June 5, 2025 22:26
@iamjustinhsu iamjustinhsu force-pushed the jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter branch from 7db0d59 to ec730b9 Compare June 6, 2025 19:10
This reverts commit c12a2c9.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ted]""

This reverts commit 0b4b974.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
This reverts commit 64ab042.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter branch 2 times, most recently from a9bbd8e to df0af64 Compare June 6, 2025 19:12
Comment on lines 149 to 156
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where we catch -all

@iamjustinhsu iamjustinhsu force-pushed the jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter branch from df0af64 to 778c8f4 Compare June 6, 2025 19:28
@iamjustinhsu iamjustinhsu marked this pull request as ready for review June 6, 2025 19:30
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner June 6, 2025 19:30
@iamjustinhsu iamjustinhsu requested a review from raulchen June 6, 2025 19:56
@iamjustinhsu iamjustinhsu force-pushed the jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter branch from 778c8f4 to 89549c8 Compare June 6, 2025 20:02
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter branch from 89549c8 to 5f6bbed Compare June 6, 2025 20:02

def _get_args(self) -> Dict[str, Any]:
"""This Dict must be serializable"""
return {k: sanitize_for_struct(v) for k, v in vars(self).items()}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously sanitize_for_struct was called on the metadata exporter.
And that looks cleaner than putting it here.
Was there any issues?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the issue was sometimes the args are not pickable, so i was encountering this error

 !!! FAIL serialization: cannot pickle '_thread.lock' object

This is because we call register_dataset_for_stats_actor, which rpc serializes and fails. honestly i could case handle this, but didn't want to risk

data_context: DataContext


def sanitize_for_struct(obj):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually should we just json.dumps the entire obj?
this function seems to be just reinventing json.dumps.

Copy link
Copy Markdown
Contributor Author

@iamjustinhsu iamjustinhsu Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds smart, im gonna try that Oh wait, the reason why is because if there is a deeply nested json, and one argument is not serialzable, then it fails the entire thing. The sanitize_for_struct gives a best guess estimate of the obj itself. lmk if that doesn't make sense

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

Copy link
Copy Markdown
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. a few final comments.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter branch from 80ffca0 to 2e29c7a Compare June 6, 2025 21:12
@iamjustinhsu iamjustinhsu added the go add ONLY when ready to merge, run all tests label Jun 6, 2025
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@raulchen raulchen merged commit b04c151 into ray-project:master Jun 9, 2025
5 checks passed
@iamjustinhsu iamjustinhsu deleted the jhsu/add-ctx-and-logical-ops-to-dataset-metadata-exporter branch June 9, 2025 18:00
bveeramani pushed a commit that referenced this pull request Nov 20, 2025
… op args when refreshing metadata (#58755)

## Description
We export dataset and operator metadata whenever there is a state
change. However, the size of the export file can be very large because
the metadata also includes the DataContext config and operator args,
which does not change over time, and they will be written multiple times
to the file. To reduce the file size and remove redundant info, we can
only export DataContext and operator args when the dataset is
[registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621),
and avoid them in the later state updates.

## Related issues
Related previous PRs:
[55355](#55355),
[53554](#53554)

Signed-off-by: cong.qian <cong.qian@anyscale.com>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Nov 21, 2025
… op args when refreshing metadata (ray-project#58755)

## Description
We export dataset and operator metadata whenever there is a state
change. However, the size of the export file can be very large because
the metadata also includes the DataContext config and operator args,
which does not change over time, and they will be written multiple times
to the file. To reduce the file size and remove redundant info, we can
only export DataContext and operator args when the dataset is
[registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621),
and avoid them in the later state updates.

## Related issues
Related previous PRs:
[55355](ray-project#55355),
[53554](ray-project#53554)

Signed-off-by: cong.qian <cong.qian@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
… op args when refreshing metadata (ray-project#58755)

## Description
We export dataset and operator metadata whenever there is a state
change. However, the size of the export file can be very large because
the metadata also includes the DataContext config and operator args,
which does not change over time, and they will be written multiple times
to the file. To reduce the file size and remove redundant info, we can
only export DataContext and operator args when the dataset is
[registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621),
and avoid them in the later state updates.

## Related issues
Related previous PRs:
[55355](ray-project#55355),
[53554](ray-project#53554)

Signed-off-by: cong.qian <cong.qian@anyscale.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
… op args when refreshing metadata (ray-project#58755)

## Description
We export dataset and operator metadata whenever there is a state
change. However, the size of the export file can be very large because
the metadata also includes the DataContext config and operator args,
which does not change over time, and they will be written multiple times
to the file. To reduce the file size and remove redundant info, we can
only export DataContext and operator args when the dataset is
[registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621),
and avoid them in the later state updates.

## Related issues
Related previous PRs:
[55355](ray-project#55355),
[53554](ray-project#53554)

Signed-off-by: cong.qian <cong.qian@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
… op args when refreshing metadata (ray-project#58755)

## Description
We export dataset and operator metadata whenever there is a state
change. However, the size of the export file can be very large because
the metadata also includes the DataContext config and operator args,
which does not change over time, and they will be written multiple times
to the file. To reduce the file size and remove redundant info, we can
only export DataContext and operator args when the dataset is
[registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621),
and avoid them in the later state updates.

## Related issues
Related previous PRs:
[55355](ray-project#55355),
[53554](ray-project#53554)

Signed-off-by: cong.qian <cong.qian@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants