[Data] Update Export API metadata and refresh the dataset/operator state when there is a change#54623
Conversation
|
Warning Gemini encountered an error creating the summary. You can try again by commenting |
763855a to
ea748e4
Compare
|
what's remaining for this PR to be ready for review? |
c334865 to
150309c
Compare
|
Example export event: |
150309c to
b2fc6eb
Compare
can-anyscale
left a comment
There was a problem hiding this comment.
perhaps @omatthew98 can help review the data part; stamp from core as code owner
python/ray/data/_internal/stats.py
Outdated
| operator_metadata.end_time = update_time | ||
| # Handle outlier case for InputDataBuffer, which is marked as finished immediately and does not have a RUNNING state. | ||
| # Set the execution time the same as its end time | ||
| if operator.name == "Input": |
There was a problem hiding this comment.
This feels fragile. Not sure if there is a better way to support this though. If we do not do this, is the consequence that we might have some input data buffer operators that have no set execution_start_time?
There was a problem hiding this comment.
Yes, because execution_start_time is only set when the stats changes to RUNNING, which is not applicable for InputDataBuffer
There was a problem hiding this comment.
Can you pull the "Input" string out as a constant from here
, and then use that constant rather than a raw string?There was a problem hiding this comment.
Updated the logic to be more general to check the execution start time instead of the name. After this change, if any operator is marked as finished immediately and does not have a RUNNING state, we will set its start time the same as end time.
| update_operator_states(topology) | ||
| self._refresh_progress_bars(topology) | ||
|
|
||
| StatsManager.update_export_dataset_state( |
There was a problem hiding this comment.
I wonder if we could do some of the only updating the state if it changes logic here to prevent unnecessary calls to the stats actor? Basically prevent the more expensive path from being hit and reserve that for actual updates that we know are necessary?
There was a problem hiding this comment.
For PENDING/FINISHED/FAILED states, they are already reported just once. Added dataset and operator flags to prevent the RUNNING state being unnecessarily called.
a2ed80d to
1d36c06
Compare
|
@raulchen when you get a chance, could you also help take a review? Thanks a lot |
d303b0f to
ec7a6b2
Compare
when they get updated Signed-off-by: cong.qian <cong.qian@anyscale.com>
ec7a6b2 to
acb04cc
Compare
…ate when there is a change (#54623) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Some frequently used metadata fields are missing in the export API schema: - For both dataset and operator: state, execution start and end time These fields are important for us to observe the lifecycle of the datasets and operators, and can be used to improve the accuracy of reported metrics, such as throughput, which relies on the duration. <!-- Please give a short summary of the change and the problem this solves. --> Summary of change: - Add state, execution start and end time at the export API schema - Add a new state enum `PENDING` for dataset and operator, to represent the state when they are not running yet. - Refresh the metadata when ever the state of dataset/operator gets updated. And the event will always contains the latest snapshot of all the metadata. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: cong.qian <cong.qian@anyscale.com> Signed-off-by: sampan <sampan@anyscale.com>
…ate when there is a change (ray-project#54623) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Some frequently used metadata fields are missing in the export API schema: - For both dataset and operator: state, execution start and end time These fields are important for us to observe the lifecycle of the datasets and operators, and can be used to improve the accuracy of reported metrics, such as throughput, which relies on the duration. <!-- Please give a short summary of the change and the problem this solves. --> Summary of change: - Add state, execution start and end time at the export API schema - Add a new state enum `PENDING` for dataset and operator, to represent the state when they are not running yet. - Refresh the metadata when ever the state of dataset/operator gets updated. And the event will always contains the latest snapshot of all the metadata. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: cong.qian <cong.qian@anyscale.com> Signed-off-by: Andrew Grosser <dioptre@gmail.com>
…rator state when there is a change (ray-project#54623)" (ray-project#55333) reverts commit 7cb74e8, that broke master branch due to conflict with ray-project#55163 Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Andrew Grosser <dioptre@gmail.com>
…ate when there is a change (ray-project#54623) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Some frequently used metadata fields are missing in the export API schema: - For both dataset and operator: state, execution start and end time These fields are important for us to observe the lifecycle of the datasets and operators, and can be used to improve the accuracy of reported metrics, such as throughput, which relies on the duration. <!-- Please give a short summary of the change and the problem this solves. --> Summary of change: - Add state, execution start and end time at the export API schema - Add a new state enum `PENDING` for dataset and operator, to represent the state when they are not running yet. - Refresh the metadata when ever the state of dataset/operator gets updated. And the event will always contains the latest snapshot of all the metadata. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: cong.qian <cong.qian@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…rator state when there is a change (ray-project#54623)" (ray-project#55333) reverts commit 7cb74e8, that broke master branch due to conflict with ray-project#55163 Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…ate when there is a change (#54623) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Some frequently used metadata fields are missing in the export API schema: - For both dataset and operator: state, execution start and end time These fields are important for us to observe the lifecycle of the datasets and operators, and can be used to improve the accuracy of reported metrics, such as throughput, which relies on the duration. <!-- Please give a short summary of the change and the problem this solves. --> Summary of change: - Add state, execution start and end time at the export API schema - Add a new state enum `PENDING` for dataset and operator, to represent the state when they are not running yet. - Refresh the metadata when ever the state of dataset/operator gets updated. And the event will always contains the latest snapshot of all the metadata. ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: cong.qian <cong.qian@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Why are these changes needed?
Some frequently used metadata fields are missing in the export API schema:
These fields are important for us to observe the lifecycle of the datasets and operators, and can be used to improve the accuracy of reported metrics, such as throughput, which relies on the duration.
Summary of change:
PENDINGfor dataset and operator, to represent the state when they are not running yet.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.