Add DataContext + LogicalOp Args to Dataset Export#53554
Conversation
87694b0 to
6a6455f
Compare
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
6a6455f to
8fb69eb
Compare
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
eac92c1 to
384d5fb
Compare
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
68610f1 to
fb7131a
Compare
85999e9 to
0b4b974
Compare
…u/add-ctx-and-logical-ops-to-dataset-metadata-exporter
20d1bf5 to
c12a2c9
Compare
7db0d59 to
ec730b9
Compare
a9bbd8e to
df0af64
Compare
There was a problem hiding this comment.
this is where we catch -all
df0af64 to
778c8f4
Compare
778c8f4 to
89549c8
Compare
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
89549c8 to
5f6bbed
Compare
|
|
||
| def _get_args(self) -> Dict[str, Any]: | ||
| """This Dict must be serializable""" | ||
| return {k: sanitize_for_struct(v) for k, v in vars(self).items()} |
There was a problem hiding this comment.
Previously sanitize_for_struct was called on the metadata exporter.
And that looks cleaner than putting it here.
Was there any issues?
There was a problem hiding this comment.
yes, the issue was sometimes the args are not pickable, so i was encountering this error
!!! FAIL serialization: cannot pickle '_thread.lock' object
This is because we call register_dataset_for_stats_actor, which rpc serializes and fails. honestly i could case handle this, but didn't want to risk
| data_context: DataContext | ||
|
|
||
|
|
||
| def sanitize_for_struct(obj): |
There was a problem hiding this comment.
actually should we just json.dumps the entire obj?
this function seems to be just reinventing json.dumps.
There was a problem hiding this comment.
that sounds smart, im gonna try that Oh wait, the reason why is because if there is a deeply nested json, and one argument is not serialzable, then it fails the entire thing. The sanitize_for_struct gives a best guess estimate of the obj itself. lmk if that doesn't make sense
raulchen
left a comment
There was a problem hiding this comment.
LGTM. a few final comments.
80ffca0 to
2e29c7a
Compare
… op args when refreshing metadata (#58755) ## Description We export dataset and operator metadata whenever there is a state change. However, the size of the export file can be very large because the metadata also includes the DataContext config and operator args, which does not change over time, and they will be written multiple times to the file. To reduce the file size and remove redundant info, we can only export DataContext and operator args when the dataset is [registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621), and avoid them in the later state updates. ## Related issues Related previous PRs: [55355](#55355), [53554](#53554) Signed-off-by: cong.qian <cong.qian@anyscale.com>
… op args when refreshing metadata (ray-project#58755) ## Description We export dataset and operator metadata whenever there is a state change. However, the size of the export file can be very large because the metadata also includes the DataContext config and operator args, which does not change over time, and they will be written multiple times to the file. To reduce the file size and remove redundant info, we can only export DataContext and operator args when the dataset is [registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621), and avoid them in the later state updates. ## Related issues Related previous PRs: [55355](ray-project#55355), [53554](ray-project#53554) Signed-off-by: cong.qian <cong.qian@anyscale.com>
… op args when refreshing metadata (ray-project#58755) ## Description We export dataset and operator metadata whenever there is a state change. However, the size of the export file can be very large because the metadata also includes the DataContext config and operator args, which does not change over time, and they will be written multiple times to the file. To reduce the file size and remove redundant info, we can only export DataContext and operator args when the dataset is [registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621), and avoid them in the later state updates. ## Related issues Related previous PRs: [55355](ray-project#55355), [53554](ray-project#53554) Signed-off-by: cong.qian <cong.qian@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
… op args when refreshing metadata (ray-project#58755) ## Description We export dataset and operator metadata whenever there is a state change. However, the size of the export file can be very large because the metadata also includes the DataContext config and operator args, which does not change over time, and they will be written multiple times to the file. To reduce the file size and remove redundant info, we can only export DataContext and operator args when the dataset is [registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621), and avoid them in the later state updates. ## Related issues Related previous PRs: [55355](ray-project#55355), [53554](ray-project#53554) Signed-off-by: cong.qian <cong.qian@anyscale.com>
… op args when refreshing metadata (ray-project#58755) ## Description We export dataset and operator metadata whenever there is a state change. However, the size of the export file can be very large because the metadata also includes the DataContext config and operator args, which does not change over time, and they will be written multiple times to the file. To reduce the file size and remove redundant info, we can only export DataContext and operator args when the dataset is [registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621), and avoid them in the later state updates. ## Related issues Related previous PRs: [55355](ray-project#55355), [53554](ray-project#53554) Signed-off-by: cong.qian <cong.qian@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Why are these changes needed?
event_EXPORT_DATASET_METADATA.log
^ Here is an example file.
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.