Skip to content

[Data] Update Export API metadata and refresh the dataset/operator state when there is a change#54623

Merged
alexeykudinkin merged 1 commit intoray-project:masterfrom
coqian:coqian/data-export-refresh
Aug 6, 2025
Merged

[Data] Update Export API metadata and refresh the dataset/operator state when there is a change#54623
alexeykudinkin merged 1 commit intoray-project:masterfrom
coqian:coqian/data-export-refresh

Conversation

@coqian
Copy link
Copy Markdown
Contributor

@coqian coqian commented Jul 15, 2025

Why are these changes needed?

Some frequently used metadata fields are missing in the export API schema:

  • For both dataset and operator: state, execution start and end time

These fields are important for us to observe the lifecycle of the datasets and operators, and can be used to improve the accuracy of reported metrics, such as throughput, which relies on the duration.

Summary of change:

  • Add state, execution start and end time at the export API schema
  • Add a new state enum PENDING for dataset and operator, to represent the state when they are not running yet.
  • Refresh the metadata when ever the state of dataset/operator gets updated. And the event will always contains the latest snapshot of all the metadata.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the summary. You can try again by commenting /gemini summary.

@coqian coqian force-pushed the coqian/data-export-refresh branch from 763855a to ea748e4 Compare July 21, 2025 22:45
@alanwguo
Copy link
Copy Markdown
Contributor

what's remaining for this PR to be ready for review?

@coqian coqian force-pushed the coqian/data-export-refresh branch 4 times, most recently from c334865 to 150309c Compare July 25, 2025 22:31
@coqian coqian marked this pull request as ready for review July 25, 2025 22:33
@coqian coqian requested review from a team as code owners July 25, 2025 22:33
@coqian coqian changed the title [WIP][Data] Update Export API metadata and refresh the dataset/operator state when there is a change [Data] Update Export API metadata and refresh the dataset/operator state when there is a change Jul 25, 2025
@coqian
Copy link
Copy Markdown
Contributor Author

coqian commented Jul 25, 2025

Example export event:

{
  "event_id": "0E4F5fBba8ac7Ff162",
  "timestamp": 1753137602,
  "source_type": "EXPORT_DATASET_METADATA",
  "event_data": {
    "topology": {
      "operators": [
        {
          "name": "Input",
          "id": "Input_0",
          "uuid": "2e553a51-cd08-4ddd-9b5d-5da9a59062b7",
          "execution_time": 1753137602.424319,
          "end_time": 1753137602.424319,
          "state": "FINISHED",
          "input_dependencies": [],
          "sub_stages": []
        },
        {
          "name": "Input",
          "id": "Input_1",
          "uuid": "28394852-ecb7-478f-8484-aa4f9067a486",
          "execution_time": 1753137602.4274538,
          "end_time": 1753137602.4274538,
          "state": "FINISHED",
          "input_dependencies": [],
          "sub_stages": []
        },
        {
          "name": "Input",
          "id": "Input_2",
          "uuid": "678ce9dd-9e8d-49ba-b8ba-9c36430a80d2",
          "execution_time": 1753137602.4300847,
          "end_time": 1753137602.4300847,
          "state": "FINISHED",
          "input_dependencies": [],
          "sub_stages": []
        },
        {
          "name": "UnionOperator(Input, Input, Input)",
          "id": "UnionOperator(Input, Input, Input)_3",
          "uuid": "53cd5d39-5f33-4215-a474-f365a8633f32",
          "input_dependencies": [
            "Input_0",
            "Input_1",
            "Input_2"
          ],
          "execution_time": 1753137602.406018,
          "end_time": 1753137602.4426837,
          "state": "FINISHED",
          "sub_stages": []
        },
        {
          "name": "Input",
          "id": "Input_4",
          "uuid": "8ea1d9be-5f1c-4ef9-9e22-200cc0ec93c8",
          "execution_time": 1753137602.4329152,
          "end_time": 1753137602.4329152,
          "state": "FINISHED",
          "input_dependencies": [],
          "sub_stages": []
        },
        {
          "name": "UnionOperator(UnionOperator(Input, Input, Input), Input)",
          "id": "UnionOperator(UnionOperator(Input, Input, Input), Input)_5",
          "uuid": "3e457b74-a185-43fd-bc85-1c477c3d04fa",
          "input_dependencies": [
            "UnionOperator(Input, Input, Input)_3",
            "Input_4"
          ],
          "execution_time": 1753137602.4110644,
          "end_time": 1753137602.4490502,
          "state": "FINISHED",
          "sub_stages": []
        }
      ]
    },
    "dataset_id": "dataset_4_0",
    "job_id": "01000000",
    "start_time": 1753137602.390922,
    "execution_time": 1753137602.4007301,
    "end_time": 1753137602.4568942,
    "state": "FINISHED"
  }
}

@coqian coqian force-pushed the coqian/data-export-refresh branch from 150309c to b2fc6eb Compare July 28, 2025 07:24
Copy link
Copy Markdown
Contributor

@can-anyscale can-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps @omatthew98 can help review the data part; stamp from core as code owner

operator_metadata.end_time = update_time
# Handle outlier case for InputDataBuffer, which is marked as finished immediately and does not have a RUNNING state.
# Set the execution time the same as its end time
if operator.name == "Input":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels fragile. Not sure if there is a better way to support this though. If we do not do this, is the consequence that we might have some input data buffer operators that have no set execution_start_time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because execution_start_time is only set when the stats changes to RUNNING, which is not applicable for InputDataBuffer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you pull the "Input" string out as a constant from here

super().__init__("Input", [], data_context, target_max_block_size=None)
, and then use that constant rather than a raw string?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the logic to be more general to check the execution start time instead of the name. After this change, if any operator is marked as finished immediately and does not have a RUNNING state, we will set its start time the same as end time.

update_operator_states(topology)
self._refresh_progress_bars(topology)

StatsManager.update_export_dataset_state(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could do some of the only updating the state if it changes logic here to prevent unnecessary calls to the stats actor? Basically prevent the more expensive path from being hit and reserve that for actual updates that we know are necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PENDING/FINISHED/FAILED states, they are already reported just once. Added dataset and operator flags to prevent the RUNNING state being unnecessarily called.

@coqian coqian force-pushed the coqian/data-export-refresh branch 2 times, most recently from a2ed80d to 1d36c06 Compare July 30, 2025 19:58
@coqian
Copy link
Copy Markdown
Contributor Author

coqian commented Jul 30, 2025

@raulchen when you get a chance, could you also help take a review? Thanks a lot

@coqian coqian force-pushed the coqian/data-export-refresh branch 4 times, most recently from d303b0f to ec7a6b2 Compare August 5, 2025 00:17
when they get updated

Signed-off-by: cong.qian <cong.qian@anyscale.com>
@coqian coqian force-pushed the coqian/data-export-refresh branch from ec7a6b2 to acb04cc Compare August 5, 2025 04:25
@omatthew98 omatthew98 added the go add ONLY when ready to merge, run all tests label Aug 6, 2025
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) August 6, 2025 21:14
@alexeykudinkin alexeykudinkin merged commit 7cb74e8 into ray-project:master Aug 6, 2025
7 checks passed
alexeykudinkin added a commit that referenced this pull request Aug 7, 2025
…rator state when there is a change (#54623)"

This reverts commit 7cb74e8.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
aslonnie pushed a commit that referenced this pull request Aug 7, 2025
…rator state when there is a change (#54623)" (#55333)

reverts commit 7cb74e8, that broke
master branch due to conflict with
#55163


Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
sampan-s-nayak pushed a commit that referenced this pull request Aug 12, 2025
…ate when there is a change (#54623)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Some frequently used metadata fields are missing in the export API
schema:
- For both dataset and operator: state, execution start and end time

These fields are important for us to observe the lifecycle of the
datasets and operators, and can be used to improve the accuracy of
reported metrics, such as throughput, which relies on the duration.

<!-- Please give a short summary of the change and the problem this
solves. -->
Summary of change:
- Add state, execution start and end time at the export API schema
- Add a new state enum `PENDING` for dataset and operator, to represent
the state when they are not running yet.
- Refresh the metadata when ever the state of dataset/operator gets
updated. And the event will always contains the latest snapshot of all
the metadata.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: cong.qian <cong.qian@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
sampan-s-nayak pushed a commit that referenced this pull request Aug 12, 2025
…rator state when there is a change (#54623)" (#55333)

reverts commit 7cb74e8, that broke
master branch due to conflict with
#55163


Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
dioptre pushed a commit to sourcetable/ray that referenced this pull request Aug 20, 2025
…ate when there is a change (ray-project#54623)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Some frequently used metadata fields are missing in the export API
schema:
- For both dataset and operator: state, execution start and end time

These fields are important for us to observe the lifecycle of the
datasets and operators, and can be used to improve the accuracy of
reported metrics, such as throughput, which relies on the duration.

<!-- Please give a short summary of the change and the problem this
solves. -->
Summary of change:
- Add state, execution start and end time at the export API schema
- Add a new state enum `PENDING` for dataset and operator, to represent
the state when they are not running yet.
- Refresh the metadata when ever the state of dataset/operator gets
updated. And the event will always contains the latest snapshot of all
the metadata.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: cong.qian <cong.qian@anyscale.com>
Signed-off-by: Andrew Grosser <dioptre@gmail.com>
dioptre pushed a commit to sourcetable/ray that referenced this pull request Aug 20, 2025
…rator state when there is a change (ray-project#54623)" (ray-project#55333)

reverts commit 7cb74e8, that broke
master branch due to conflict with
ray-project#55163

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Andrew Grosser <dioptre@gmail.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…ate when there is a change (ray-project#54623)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Some frequently used metadata fields are missing in the export API
schema:
- For both dataset and operator: state, execution start and end time

These fields are important for us to observe the lifecycle of the
datasets and operators, and can be used to improve the accuracy of
reported metrics, such as throughput, which relies on the duration.

<!-- Please give a short summary of the change and the problem this
solves. -->
Summary of change:
- Add state, execution start and end time at the export API schema
- Add a new state enum `PENDING` for dataset and operator, to represent
the state when they are not running yet.
- Refresh the metadata when ever the state of dataset/operator gets
updated. And the event will always contains the latest snapshot of all
the metadata.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: cong.qian <cong.qian@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…rator state when there is a change (ray-project#54623)" (ray-project#55333)

reverts commit 7cb74e8, that broke
master branch due to conflict with
ray-project#55163

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…ate when there is a change (#54623)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Some frequently used metadata fields are missing in the export API
schema:
- For both dataset and operator: state, execution start and end time

These fields are important for us to observe the lifecycle of the
datasets and operators, and can be used to improve the accuracy of
reported metrics, such as throughput, which relies on the duration.

<!-- Please give a short summary of the change and the problem this
solves. -->
Summary of change:
- Add state, execution start and end time at the export API schema
- Add a new state enum `PENDING` for dataset and operator, to represent
the state when they are not running yet.
- Refresh the metadata when ever the state of dataset/operator gets
updated. And the event will always contains the latest snapshot of all
the metadata.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: cong.qian <cong.qian@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…rator state when there is a change (#54623)" (#55333)

reverts commit 7cb74e8, that broke
master branch due to conflict with
#55163

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants