[observability][export-api] Write TrainRun events#47888
[observability][export-api] Write TrainRun events#47888nikitavemuri wants to merge 11 commits intoray-project:masterfrom
Conversation
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
…ita/train_run_event_prototype
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
| ExportTrainRunInfo.TrainWorkerInfo( | ||
| actor_id=worker.actor_id, | ||
| world_rank=worker.world_rank, | ||
| local_rank=worker.local_rank, | ||
| node_rank=worker.node_rank, | ||
| node_id=worker.node_id, | ||
| node_ip=worker.node_ip, | ||
| pid=worker.pid, | ||
| gpu_ids=worker.gpu_ids, | ||
| status=worker.status, | ||
| ) | ||
| for worker in run_info.workers |
There was a problem hiding this comment.
Is TrainWorkerInfo all available at once when this function gets called? Is it because the train run event doesn't get created until all the workers get created?
There was a problem hiding this comment.
Yes, looks like this is called after all the workers and datasets are initialized but before the training actually starts:
There was a problem hiding this comment.
Worth verifying that it's not possible for additional workers to be added later.
There was a problem hiding this comment.
@matthewdeng Could you confirm that additional workers are not added after a train run starts? This current behavior should contain the information in the train head API, because the export event is written when that data is updated
There was a problem hiding this comment.
In the Train v2 elastic training setup, workers may join or leave the worker group dynamically. Specifically, there can be multiple run_ids associated with a single training name. Whenever new workers join, existing workers leave, or errors occur within the worker group, the system takes the following actions:
- Shut down the current worker group and terminate the corresponding train worker actors.
- Start a new worker group using the available resources, which may result in a different number of workers compared to the previous setup.
In this setup, the run_id corresponds to the lifetime of a specific worker group, while the name corresponds to the lifetime of the driver/controller. Once a run_id is created, no additional workers are added to its worker group. However, in Train v2, a single name can be associated with multiple run_ids, each potentially involving a different number of workers.
This implementation should still function correctly in Train v2, but when presenting logs to the frontend, it may be necessary to group related logs by name for clarity.
| ALIVE = 1; | ||
| } | ||
|
|
||
| message TrainWorkerInfo { |
There was a problem hiding this comment.
what's the benefit of doing a sub message instead of top-level message?
There was a problem hiding this comment.
I mainly added as a nested message for consistency with the other export API schemas. None of these nested schemas need to be used by the other event types, so this can help with encapsulation. We can also create subsets of schemas of other resources specific to train runs (eg: the only valid ActorStatus for a train run is ALIVE and DEAD)
|
|
||
| message ExportTrainRunInfo { | ||
| // State of an train run. | ||
| enum RunStatus { |
There was a problem hiding this comment.
do we need an UNKNOWN enum?
I think it's a pattern of protobufs: https://stackoverflow.com/a/17164524
There was a problem hiding this comment.
Yeah, this is a common best practice. The other export API schemas don't follow this, but I'll update those in a followup
…ita/train_run_event_prototype
hongpeng-guo
left a comment
There was a problem hiding this comment.
cc @nikitavemuri Thanks for persisting the train dashboard. This is a very clean PR. Left one comment above. We may have some offline discussions, as well 😄
| ExportTrainRunInfo.TrainWorkerInfo( | ||
| actor_id=worker.actor_id, | ||
| world_rank=worker.world_rank, | ||
| local_rank=worker.local_rank, | ||
| node_rank=worker.node_rank, | ||
| node_id=worker.node_id, | ||
| node_ip=worker.node_ip, | ||
| pid=worker.pid, | ||
| gpu_ids=worker.gpu_ids, | ||
| status=worker.status, | ||
| ) | ||
| for worker in run_info.workers |
There was a problem hiding this comment.
In the Train v2 elastic training setup, workers may join or leave the worker group dynamically. Specifically, there can be multiple run_ids associated with a single training name. Whenever new workers join, existing workers leave, or errors occur within the worker group, the system takes the following actions:
- Shut down the current worker group and terminate the corresponding train worker actors.
- Start a new worker group using the available resources, which may result in a different number of workers compared to the previous setup.
In this setup, the run_id corresponds to the lifetime of a specific worker group, while the name corresponds to the lifetime of the driver/controller. Once a run_id is created, no additional workers are added to its worker group. However, in Train v2, a single name can be associated with multiple run_ids, each potentially involving a different number of workers.
This implementation should still function correctly in Train v2, but when presenting logs to the frontend, it may be necessary to group related logs by name for clarity.
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
This PR adds Export API support for Ray Train state events. ## Key Changes - Added new proto messages `ExportTrainRunEventData` and `ExportTrainRunAttemptEventData` to capture training state - Created `EventLogType` enum to manage different types of export event logs - Updated `TrainStateActor` to export Train state events when export is enabled - Modified timestamp fields from milliseconds to nanoseconds (for both proto and python schema) - `start_time_ms` → `start_time_ns` - `end_time_ms` → `end_time_ns` ## Implementation Details - Train run and attempt events are now written to the `event_EXPORT_TRAIN_STATE.log` log file when the export API is enabled - Export can be enabled either globally or specifically for Train events using environment variables: - `RAY_enable_export_api_write=1` (all events) - `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN` (Train run events only) - `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN_ATTEMPT` (Train run attempt events only) Based off of #47888. Follows the new schema added in #50515. --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Alan Guo <aguo@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Alan Guo <aguo@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
This PR adds Export API support for Ray Train state events. ## Key Changes - Added new proto messages `ExportTrainRunEventData` and `ExportTrainRunAttemptEventData` to capture training state - Created `EventLogType` enum to manage different types of export event logs - Updated `TrainStateActor` to export Train state events when export is enabled - Modified timestamp fields from milliseconds to nanoseconds (for both proto and python schema) - `start_time_ms` → `start_time_ns` - `end_time_ms` → `end_time_ns` ## Implementation Details - Train run and attempt events are now written to the `event_EXPORT_TRAIN_STATE.log` log file when the export API is enabled - Export can be enabled either globally or specifically for Train events using environment variables: - `RAY_enable_export_api_write=1` (all events) - `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN` (Train run events only) - `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN_ATTEMPT` (Train run attempt events only) Based off of #47888. Follows the new schema added in #50515. --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Alan Guo <aguo@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Alan Guo <aguo@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Abrar Sheikh <abrar@anyscale.com>
This PR adds Export API support for Ray Train state events. ## Key Changes - Added new proto messages `ExportTrainRunEventData` and `ExportTrainRunAttemptEventData` to capture training state - Created `EventLogType` enum to manage different types of export event logs - Updated `TrainStateActor` to export Train state events when export is enabled - Modified timestamp fields from milliseconds to nanoseconds (for both proto and python schema) - `start_time_ms` → `start_time_ns` - `end_time_ms` → `end_time_ns` ## Implementation Details - Train run and attempt events are now written to the `event_EXPORT_TRAIN_STATE.log` log file when the export API is enabled - Export can be enabled either globally or specifically for Train events using environment variables: - `RAY_enable_export_api_write=1` (all events) - `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN` (Train run events only) - `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN_ATTEMPT` (Train run attempt events only) Based off of ray-project#47888. Follows the new schema added in ray-project#50515. --------- Signed-off-by: Matthew Deng <matt@anyscale.com> Signed-off-by: Alan Guo <aguo@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Alan Guo <aguo@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
|
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
Why are these changes needed?
TrainStateActor.register_train_runwhich is called from theTrainRunStateManagerwhen a train run is started and completedRAY_enable_export_api_writeenvironment variable flagrun_status="RUNNING"and second hasrun_status="FINISHED"):Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.