[Data] Update the export API to remove the redundant data context and op args when refreshing metadata#58755
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an effective optimization to reduce the size of exported dataset metadata. By adding flags to control the export of DataContext and operator arguments, you've successfully removed redundant information from subsequent state updates, while ensuring the full metadata is exported upon initial registration. The changes are well-implemented across metadata_exporter.py and stats.py. I have one suggestion to improve the API consistency in the abstract base class.
|
Example updated export event after the change, where the data context and operator args are only exported once. |
b31cd87 to
22703d5
Compare
python/ray/data/_internal/stats.py
Outdated
| export_data_context=False, | ||
| export_op_args=False, |
There was a problem hiding this comment.
I noticed here and down below you set the defaults to False. When do we use it to be True? Like, are we exporting just a top-level op_args and dataset context?
There was a problem hiding this comment.
It will be True when the dataset is registered, which is the first time the dataset is exported.
| export_data_context: bool = True, | ||
| export_op_args: bool = True, |
There was a problem hiding this comment.
thoughts on renaming this to include_data_context, include_op_args. This feels more descriptive to me.
There was a problem hiding this comment.
Updated with the suggested names
… op args when refreshing metadata Signed-off-by: cong.qian <cong.qian@anyscale.com>
22703d5 to
126107c
Compare
… op args when refreshing metadata (ray-project#58755) ## Description We export dataset and operator metadata whenever there is a state change. However, the size of the export file can be very large because the metadata also includes the DataContext config and operator args, which does not change over time, and they will be written multiple times to the file. To reduce the file size and remove redundant info, we can only export DataContext and operator args when the dataset is [registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621), and avoid them in the later state updates. ## Related issues Related previous PRs: [55355](ray-project#55355), [53554](ray-project#53554) Signed-off-by: cong.qian <cong.qian@anyscale.com>
… op args when refreshing metadata (ray-project#58755) ## Description We export dataset and operator metadata whenever there is a state change. However, the size of the export file can be very large because the metadata also includes the DataContext config and operator args, which does not change over time, and they will be written multiple times to the file. To reduce the file size and remove redundant info, we can only export DataContext and operator args when the dataset is [registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621), and avoid them in the later state updates. ## Related issues Related previous PRs: [55355](ray-project#55355), [53554](ray-project#53554) Signed-off-by: cong.qian <cong.qian@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
… op args when refreshing metadata (ray-project#58755) ## Description We export dataset and operator metadata whenever there is a state change. However, the size of the export file can be very large because the metadata also includes the DataContext config and operator args, which does not change over time, and they will be written multiple times to the file. To reduce the file size and remove redundant info, we can only export DataContext and operator args when the dataset is [registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621), and avoid them in the later state updates. ## Related issues Related previous PRs: [55355](ray-project#55355), [53554](ray-project#53554) Signed-off-by: cong.qian <cong.qian@anyscale.com>
… op args when refreshing metadata (ray-project#58755) ## Description We export dataset and operator metadata whenever there is a state change. However, the size of the export file can be very large because the metadata also includes the DataContext config and operator args, which does not change over time, and they will be written multiple times to the file. To reduce the file size and remove redundant info, we can only export DataContext and operator args when the dataset is [registered](https://github.com/ray-project/ray/blob/d1cce8c9dc8411fad7cfbd619350bec6f19839a3/python/ray/data/_internal/stats.py#L621), and avoid them in the later state updates. ## Related issues Related previous PRs: [55355](ray-project#55355), [53554](ray-project#53554) Signed-off-by: cong.qian <cong.qian@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
We export dataset and operator metadata whenever there is a state change. However, the size of the export file can be very large because the metadata also includes the DataContext config and operator args, which does not change over time, and they will be written multiple times to the file. To reduce the file size and remove redundant info, we can only export DataContext and operator args when the dataset is registered, and avoid them in the later state updates.
Related issues
Related previous PRs: 55355, 53554