[observability][export-api] Write TrainRun events by nikitavemuri · Pull Request #47888 · ray-project/ray

nikitavemuri · 2024-10-03T14:41:50Z

Why are these changes needed?

Write Train Run export events to file from TrainStateActor.register_train_run which is called from the TrainRunStateManager when a train run is started and completed
This is currently behind the RAY_enable_export_api_write environment variable flag
Example events from successful train run (first one has run_status="RUNNING" and second has run_status="FINISHED"):

{"event_id": "2fAe049eABc86c7fAa", "timestamp": 1731453435, "source_type": "EXPORT_TRAIN_RUN", "event_data": {"name": "TorchTrainer_2024-11-12_15-17-07", "run_id": "cfc990d6ae774b358034205f7bc37ada", "job_id": "01000000", "controller_actor_id": "74fbd23b9399293ba8fdcda101000000", "workers": [{"actor_id": "d06165bbfb04a751c4426ba901000000", "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90696, "world_rank": 0, "local_rank": 0, "node_rank": 0, "gpu_ids": []}, {"actor_id": "8d96359de381f6639fa51de501000000", "world_rank": 1, "local_rank": 1, "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90697, "node_rank": 0, "gpu_ids": []}, {"actor_id": "84ec7dcde4b3d5bd675b160101000000", "world_rank": 2, "local_rank": 2, "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90698, "node_rank": 0, "gpu_ids": []}, {"actor_id": "bf164b82c1c49bdecd57adcb01000000", "world_rank": 3, "local_rank": 3, "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90699, "node_rank": 0, "gpu_ids": []}], "start_time_ms": "1731453431802", "datasets": [], "run_status": "RUNNING", "status_detail": "", "end_time_ms": "0"}}

{"event_id": "213Ec30A1261A761b3", "timestamp": 1731453709, "source_type": "EXPORT_TRAIN_RUN", "event_data": {"name": "TorchTrainer_2024-11-12_15-17-07", "run_id": "cfc990d6ae774b358034205f7bc37ada", "job_id": "01000000", "controller_actor_id": "74fbd23b9399293ba8fdcda101000000", "workers": [{"actor_id": "d06165bbfb04a751c4426ba901000000", "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90696, "world_rank": 0, "local_rank": 0, "node_rank": 0, "gpu_ids": []}, {"actor_id": "8d96359de381f6639fa51de501000000", "world_rank": 1, "local_rank": 1, "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90697, "node_rank": 0, "gpu_ids": []}, {"actor_id": "84ec7dcde4b3d5bd675b160101000000", "world_rank": 2, "local_rank": 2, "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90698, "node_rank": 0, "gpu_ids": []}, {"actor_id": "bf164b82c1c49bdecd57adcb01000000", "world_rank": 3, "local_rank": 3, "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90699, "node_rank": 0, "gpu_ids": []}], "run_status": "FINISHED", "start_time_ms": "1731453431802", "end_time_ms": "1731453709657", "datasets": [], "status_detail": ""}}

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>

…ita/train_run_event_prototype

Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>

alanwguo · 2024-11-13T02:26:14Z

python/ray/train/_internal/state/state_actor.py

+                ExportTrainRunInfo.TrainWorkerInfo(
+                    actor_id=worker.actor_id,
+                    world_rank=worker.world_rank,
+                    local_rank=worker.local_rank,
+                    node_rank=worker.node_rank,
+                    node_id=worker.node_id,
+                    node_ip=worker.node_ip,
+                    pid=worker.pid,
+                    gpu_ids=worker.gpu_ids,
+                    status=worker.status,
+                )
+                for worker in run_info.workers


Is TrainWorkerInfo all available at once when this function gets called? Is it because the train run event doesn't get created until all the workers get created?

Yes, looks like this is called after all the workers and datasets are initialized but before the training actually starts:

ray/python/ray/train/_internal/backend_executor.py

Line 560 in 5788c4b

self.state_manager.register_train_run(

Worth verifying that it's not possible for additional workers to be added later.

@matthewdeng Could you confirm that additional workers are not added after a train run starts? This current behavior should contain the information in the train head API, because the export event is written when that data is updated

In the Train v2 elastic training setup, workers may join or leave the worker group dynamically. Specifically, there can be multiple run_ids associated with a single training name. Whenever new workers join, existing workers leave, or errors occur within the worker group, the system takes the following actions:

Shut down the current worker group and terminate the corresponding train worker actors.

Start a new worker group using the available resources, which may result in a different number of workers compared to the previous setup.

In this setup, the run_id corresponds to the lifetime of a specific worker group, while the name corresponds to the lifetime of the driver/controller. Once a run_id is created, no additional workers are added to its worker group. However, in Train v2, a single name can be associated with multiple run_ids, each potentially involving a different number of workers.

This implementation should still function correctly in Train v2, but when presenting logs to the frontend, it may be necessary to group related logs by name for clarity.

alanwguo · 2024-11-13T02:28:17Z

src/ray/protobuf/export_api/export_train_run_info.proto

+    ALIVE = 1;
+  }
+
+  message TrainWorkerInfo {


what's the benefit of doing a sub message instead of top-level message?

I mainly added as a nested message for consistency with the other export API schemas. None of these nested schemas need to be used by the other event types, so this can help with encapsulation. We can also create subsets of schemas of other resources specific to train runs (eg: the only valid ActorStatus for a train run is ALIVE and DEAD)

alanwguo · 2024-11-13T02:31:02Z

src/ray/protobuf/export_api/export_train_run_info.proto

+
+message ExportTrainRunInfo {
+  // State of an train run.
+  enum RunStatus {


do we need an UNKNOWN enum?

I think it's a pattern of protobufs: https://stackoverflow.com/a/17164524

Yeah, this is a common best practice. The other export API schemas don't follow this, but I'll update those in a followup

Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>

…ita/train_run_event_prototype

hongpeng-guo

cc @nikitavemuri Thanks for persisting the train dashboard. This is a very clean PR. Left one comment above. We may have some offline discussions, as well 😄

hongpeng-guo · 2024-11-19T21:56:17Z

python/ray/train/_internal/state/state_actor.py

+                ExportTrainRunInfo.TrainWorkerInfo(
+                    actor_id=worker.actor_id,
+                    world_rank=worker.world_rank,
+                    local_rank=worker.local_rank,
+                    node_rank=worker.node_rank,
+                    node_id=worker.node_id,
+                    node_ip=worker.node_ip,
+                    pid=worker.pid,
+                    gpu_ids=worker.gpu_ids,
+                    status=worker.status,
+                )
+                for worker in run_info.workers


In the Train v2 elastic training setup, workers may join or leave the worker group dynamically. Specifically, there can be multiple run_ids associated with a single training name. Whenever new workers join, existing workers leave, or errors occur within the worker group, the system takes the following actions:

Shut down the current worker group and terminate the corresponding train worker actors.

Start a new worker group using the available resources, which may result in a different number of workers compared to the previous setup.

In this setup, the run_id corresponds to the lifetime of a specific worker group, while the name corresponds to the lifetime of the driver/controller. Once a run_id is created, no additional workers are added to its worker group. However, in Train v2, a single name can be associated with multiple run_ids, each potentially involving a different number of workers.

This implementation should still function correctly in Train v2, but when presenting logs to the frontend, it may be necessary to group related logs by name for clarity.

stale · 2025-02-01T00:11:44Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

This PR adds Export API support for Ray Train state events. ## Key Changes - Added new proto messages `ExportTrainRunEventData` and `ExportTrainRunAttemptEventData` to capture training state - Created `EventLogType` enum to manage different types of export event logs - Updated `TrainStateActor` to export Train state events when export is enabled - Modified timestamp fields from milliseconds to nanoseconds (for both proto and python schema) - `start_time_ms` → `start_time_ns` - `end_time_ms` → `end_time_ns` ## Implementation Details - Train run and attempt events are now written to the `event_EXPORT_TRAIN_STATE.log` log file when the export API is enabled - Export can be enabled either globally or specifically for Train events using environment variables: - `RAY_enable_export_api_write=1` (all events) - `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN` (Train run events only) - `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN_ATTEMPT` (Train run attempt events only) Based off of #47888. Follows the new schema added in #50515. --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Alan Guo <aguo@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Alan Guo <aguo@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>

This PR adds Export API support for Ray Train state events. ## Key Changes - Added new proto messages `ExportTrainRunEventData` and `ExportTrainRunAttemptEventData` to capture training state - Created `EventLogType` enum to manage different types of export event logs - Updated `TrainStateActor` to export Train state events when export is enabled - Modified timestamp fields from milliseconds to nanoseconds (for both proto and python schema) - `start_time_ms` → `start_time_ns` - `end_time_ms` → `end_time_ns` ## Implementation Details - Train run and attempt events are now written to the `event_EXPORT_TRAIN_STATE.log` log file when the export API is enabled - Export can be enabled either globally or specifically for Train events using environment variables: - `RAY_enable_export_api_write=1` (all events) - `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN` (Train run events only) - `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN_ATTEMPT` (Train run attempt events only) Based off of #47888. Follows the new schema added in #50515. --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Alan Guo <aguo@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Alan Guo <aguo@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Abrar Sheikh <abrar@anyscale.com>

This PR adds Export API support for Ray Train state events. ## Key Changes - Added new proto messages `ExportTrainRunEventData` and `ExportTrainRunAttemptEventData` to capture training state - Created `EventLogType` enum to manage different types of export event logs - Updated `TrainStateActor` to export Train state events when export is enabled - Modified timestamp fields from milliseconds to nanoseconds (for both proto and python schema) - `start_time_ms` → `start_time_ns` - `end_time_ms` → `end_time_ns` ## Implementation Details - Train run and attempt events are now written to the `event_EXPORT_TRAIN_STATE.log` log file when the export API is enabled - Export can be enabled either globally or specifically for Train events using environment variables: - `RAY_enable_export_api_write=1` (all events) - `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN` (Train run events only) - `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN_ATTEMPT` (Train run attempt events only) Based off of ray-project#47888. Follows the new schema added in ray-project#50515. --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Alan Guo <aguo@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Alan Guo <aguo@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>

stale · 2025-04-25T22:24:45Z

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

Nikita Vemuri added 5 commits October 2, 2024 23:32

add train run events

028a594

Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>

Merge branch 'master' of https://github.com/nikitavemuri/ray into nik…

6af02cf

…ita/train_run_event_prototype

add train run info

e8b1b10

Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>

add train run info

97274b8

Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>

formatting

dfc4714

Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>

nikitavemuri added the go add ONLY when ready to merge, run all tests label Oct 3, 2024

Nikita Vemuri added 4 commits November 12, 2024 15:22

updates

f09700a

Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>

add test

463967d

Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>

update test

7b2df02

Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>

update formatting

d95a114

Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>

nikitavemuri changed the title ~~[Prototype] Train run export events~~ [observability][export-api] Write TrainRun events Nov 13, 2024

nikitavemuri marked this pull request as ready for review November 13, 2024 00:24

nikitavemuri requested review from hongpeng-guo, justinvyu, matthewdeng, raulchen and woshiyyya as code owners November 13, 2024 00:24

nikitavemuri requested a review from alanwguo November 13, 2024 00:28

alanwguo reviewed Nov 13, 2024

View reviewed changes

Nikita Vemuri added 2 commits November 13, 2024 12:52

update

2ce0eba

Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>

Merge branch 'master' of https://github.com/nikitavemuri/ray into nik…

95dcc63

…ita/train_run_event_prototype

justinvyu self-assigned this Nov 14, 2024

hongpeng-guo reviewed Nov 19, 2024

View reviewed changes

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Feb 1, 2025

matthewdeng mentioned this pull request Feb 14, 2025

[train][v2] implement state export #50622

Merged

8 tasks

stale bot closed this Apr 25, 2025

Conversation

nikitavemuri commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongpeng-guo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stale bot commented Feb 1, 2025

Uh oh!

stale bot commented Apr 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nikitavemuri commented Oct 3, 2024 •

edited

Loading