Skip to content

[observability][export-api] Write TrainRun events#47888

Closed
nikitavemuri wants to merge 11 commits intoray-project:masterfrom
nikitavemuri:nikita/train_run_event_prototype
Closed

[observability][export-api] Write TrainRun events#47888
nikitavemuri wants to merge 11 commits intoray-project:masterfrom
nikitavemuri:nikita/train_run_event_prototype

Conversation

@nikitavemuri
Copy link
Copy Markdown
Contributor

@nikitavemuri nikitavemuri commented Oct 3, 2024

Why are these changes needed?

  • Write Train Run export events to file from TrainStateActor.register_train_run which is called from the TrainRunStateManager when a train run is started and completed
  • This is currently behind the RAY_enable_export_api_write environment variable flag
  • Example events from successful train run (first one has run_status="RUNNING" and second has run_status="FINISHED"):
{"event_id": "2fAe049eABc86c7fAa", "timestamp": 1731453435, "source_type": "EXPORT_TRAIN_RUN", "event_data": {"name": "TorchTrainer_2024-11-12_15-17-07", "run_id": "cfc990d6ae774b358034205f7bc37ada", "job_id": "01000000", "controller_actor_id": "74fbd23b9399293ba8fdcda101000000", "workers": [{"actor_id": "d06165bbfb04a751c4426ba901000000", "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90696, "world_rank": 0, "local_rank": 0, "node_rank": 0, "gpu_ids": []}, {"actor_id": "8d96359de381f6639fa51de501000000", "world_rank": 1, "local_rank": 1, "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90697, "node_rank": 0, "gpu_ids": []}, {"actor_id": "84ec7dcde4b3d5bd675b160101000000", "world_rank": 2, "local_rank": 2, "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90698, "node_rank": 0, "gpu_ids": []}, {"actor_id": "bf164b82c1c49bdecd57adcb01000000", "world_rank": 3, "local_rank": 3, "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90699, "node_rank": 0, "gpu_ids": []}], "start_time_ms": "1731453431802", "datasets": [], "run_status": "RUNNING", "status_detail": "", "end_time_ms": "0"}}
{"event_id": "213Ec30A1261A761b3", "timestamp": 1731453709, "source_type": "EXPORT_TRAIN_RUN", "event_data": {"name": "TorchTrainer_2024-11-12_15-17-07", "run_id": "cfc990d6ae774b358034205f7bc37ada", "job_id": "01000000", "controller_actor_id": "74fbd23b9399293ba8fdcda101000000", "workers": [{"actor_id": "d06165bbfb04a751c4426ba901000000", "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90696, "world_rank": 0, "local_rank": 0, "node_rank": 0, "gpu_ids": []}, {"actor_id": "8d96359de381f6639fa51de501000000", "world_rank": 1, "local_rank": 1, "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90697, "node_rank": 0, "gpu_ids": []}, {"actor_id": "84ec7dcde4b3d5bd675b160101000000", "world_rank": 2, "local_rank": 2, "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90698, "node_rank": 0, "gpu_ids": []}, {"actor_id": "bf164b82c1c49bdecd57adcb01000000", "world_rank": 3, "local_rank": 3, "node_id": "d4a946da15a60e37ad4ae8e729bdb6ea7f66dc0eed536e330aa730e2", "node_ip": "127.0.0.1", "pid": 90699, "node_rank": 0, "gpu_ids": []}], "run_status": "FINISHED", "start_time_ms": "1731453431802", "end_time_ms": "1731453709657", "datasets": [], "status_detail": ""}}

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Nikita Vemuri added 5 commits October 2, 2024 23:32
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
@nikitavemuri nikitavemuri added the go add ONLY when ready to merge, run all tests label Oct 3, 2024
Nikita Vemuri added 4 commits November 12, 2024 15:22
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
@nikitavemuri nikitavemuri changed the title [Prototype] Train run export events [observability][export-api] Write TrainRun events Nov 13, 2024
@nikitavemuri nikitavemuri marked this pull request as ready for review November 13, 2024 00:24
Comment on lines +65 to +76
ExportTrainRunInfo.TrainWorkerInfo(
actor_id=worker.actor_id,
world_rank=worker.world_rank,
local_rank=worker.local_rank,
node_rank=worker.node_rank,
node_id=worker.node_id,
node_ip=worker.node_ip,
pid=worker.pid,
gpu_ids=worker.gpu_ids,
status=worker.status,
)
for worker in run_info.workers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is TrainWorkerInfo all available at once when this function gets called? Is it because the train run event doesn't get created until all the workers get created?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looks like this is called after all the workers and datasets are initialized but before the training actually starts:

self.state_manager.register_train_run(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth verifying that it's not possible for additional workers to be added later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewdeng Could you confirm that additional workers are not added after a train run starts? This current behavior should contain the information in the train head API, because the export event is written when that data is updated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Train v2 elastic training setup, workers may join or leave the worker group dynamically. Specifically, there can be multiple run_ids associated with a single training name. Whenever new workers join, existing workers leave, or errors occur within the worker group, the system takes the following actions:

  1. Shut down the current worker group and terminate the corresponding train worker actors.
  2. Start a new worker group using the available resources, which may result in a different number of workers compared to the previous setup.

In this setup, the run_id corresponds to the lifetime of a specific worker group, while the name corresponds to the lifetime of the driver/controller. Once a run_id is created, no additional workers are added to its worker group. However, in Train v2, a single name can be associated with multiple run_ids, each potentially involving a different number of workers.

This implementation should still function correctly in Train v2, but when presenting logs to the frontend, it may be necessary to group related logs by name for clarity.

ALIVE = 1;
}

message TrainWorkerInfo {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the benefit of doing a sub message instead of top-level message?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly added as a nested message for consistency with the other export API schemas. None of these nested schemas need to be used by the other event types, so this can help with encapsulation. We can also create subsets of schemas of other resources specific to train runs (eg: the only valid ActorStatus for a train run is ALIVE and DEAD)


message ExportTrainRunInfo {
// State of an train run.
enum RunStatus {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need an UNKNOWN enum?

I think it's a pattern of protobufs: https://stackoverflow.com/a/17164524

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a common best practice. The other export API schemas don't follow this, but I'll update those in a followup

Nikita Vemuri added 2 commits November 13, 2024 12:52
Signed-off-by: Nikita Vemuri <nikitavemuri@anyscale.com>
@justinvyu justinvyu self-assigned this Nov 14, 2024
Copy link
Copy Markdown
Contributor

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @nikitavemuri Thanks for persisting the train dashboard. This is a very clean PR. Left one comment above. We may have some offline discussions, as well 😄

Comment on lines +65 to +76
ExportTrainRunInfo.TrainWorkerInfo(
actor_id=worker.actor_id,
world_rank=worker.world_rank,
local_rank=worker.local_rank,
node_rank=worker.node_rank,
node_id=worker.node_id,
node_ip=worker.node_ip,
pid=worker.pid,
gpu_ids=worker.gpu_ids,
status=worker.status,
)
for worker in run_info.workers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Train v2 elastic training setup, workers may join or leave the worker group dynamically. Specifically, there can be multiple run_ids associated with a single training name. Whenever new workers join, existing workers leave, or errors occur within the worker group, the system takes the following actions:

  1. Shut down the current worker group and terminate the corresponding train worker actors.
  2. Start a new worker group using the available resources, which may result in a different number of workers compared to the previous setup.

In this setup, the run_id corresponds to the lifetime of a specific worker group, while the name corresponds to the lifetime of the driver/controller. Once a run_id is created, no additional workers are added to its worker group. However, in Train v2, a single name can be associated with multiple run_ids, each potentially involving a different number of workers.

This implementation should still function correctly in Train v2, but when presenting logs to the frontend, it may be necessary to group related logs by name for clarity.

@stale
Copy link
Copy Markdown

stale bot commented Feb 1, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Feb 1, 2025
matthewdeng added a commit that referenced this pull request Mar 4, 2025
This PR adds Export API support for Ray Train state events.

## Key Changes

- Added new proto messages `ExportTrainRunEventData` and
`ExportTrainRunAttemptEventData` to capture training state
- Created `EventLogType` enum to manage different types of export event
logs
- Updated `TrainStateActor` to export Train state events when export is
enabled
- Modified timestamp fields from milliseconds to nanoseconds (for both
proto and python schema)
  - `start_time_ms` → `start_time_ns`
  - `end_time_ms` → `end_time_ns`

## Implementation Details

- Train run and attempt events are now written to the
`event_EXPORT_TRAIN_STATE.log` log file when the export API is enabled
- Export can be enabled either globally or specifically for Train events
using environment variables:
  - `RAY_enable_export_api_write=1` (all events)
- `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN` (Train run
events only)
- `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN_ATTEMPT` (Train
run attempt events only)

Based off of #47888.
Follows the new schema added in
#50515.

---------

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Alan Guo <aguo@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
abrarsheikh pushed a commit that referenced this pull request Mar 8, 2025
This PR adds Export API support for Ray Train state events.

## Key Changes

- Added new proto messages `ExportTrainRunEventData` and
`ExportTrainRunAttemptEventData` to capture training state
- Created `EventLogType` enum to manage different types of export event
logs
- Updated `TrainStateActor` to export Train state events when export is
enabled
- Modified timestamp fields from milliseconds to nanoseconds (for both
proto and python schema)
  - `start_time_ms` → `start_time_ns`
  - `end_time_ms` → `end_time_ns`

## Implementation Details

- Train run and attempt events are now written to the
`event_EXPORT_TRAIN_STATE.log` log file when the export API is enabled
- Export can be enabled either globally or specifically for Train events
using environment variables:
  - `RAY_enable_export_api_write=1` (all events)
- `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN` (Train run
events only)
- `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN_ATTEMPT` (Train
run attempt events only)

Based off of #47888.
Follows the new schema added in
#50515.

---------

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Alan Guo <aguo@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Abrar Sheikh <abrar@anyscale.com>
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
This PR adds Export API support for Ray Train state events.

## Key Changes

- Added new proto messages `ExportTrainRunEventData` and
`ExportTrainRunAttemptEventData` to capture training state
- Created `EventLogType` enum to manage different types of export event
logs
- Updated `TrainStateActor` to export Train state events when export is
enabled
- Modified timestamp fields from milliseconds to nanoseconds (for both
proto and python schema)
  - `start_time_ms` → `start_time_ns`
  - `end_time_ms` → `end_time_ns`

## Implementation Details

- Train run and attempt events are now written to the
`event_EXPORT_TRAIN_STATE.log` log file when the export API is enabled
- Export can be enabled either globally or specifically for Train events
using environment variables:
  - `RAY_enable_export_api_write=1` (all events)
- `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN` (Train run
events only)
- `RAY_enable_export_api_write_config=EXPORT_TRAIN_RUN_ATTEMPT` (Train
run attempt events only)

Based off of ray-project#47888.
Follows the new schema added in
ray-project#50515.

---------

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Alan Guo <aguo@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
@stale
Copy link
Copy Markdown

stale bot commented Apr 25, 2025

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@stale stale bot closed this Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests stale The issue is stale. It will be closed within 7 days unless there are further conversation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants