[train] Add SHUTTING_DOWN TrainControllerState and improve logging by TimothySeah · Pull Request #57882 · ray-project/ray

TimothySeah · 2025-10-18T19:32:24Z

Summary

The crux of the issue is that in the past, train run status was synonymous with final worker group status, but now, when there are pending validations, the worker group is finished but the train run is not. This leads to confusing situations in which the Train Run is FINISHED, but because there are pending validations, the controller actor is alive and results are inaccessible.

This PR:

Adds a new SHUTTING_DOWN TrainControllerState that happens after the worker group finishes but before the controller shuts everything down.
Makes ValidationManager logging slightly cleaner.

Like RESCHEDULING, SHUTTING_DOWN is a hidden state that shows up in StateManager logs and Grafana but not in the state export. We only want to show terminal states in the state export after fit() has returned and results are accessible. More concretely:

Finished/errored: The worker group finishes (Train Run is RUNNING but internal state is SHUTTING_DOWN), validation finishes (both Train Run and internal state say FINISHED or ERRORED), then results are accessible.
Aborted: Ideally, the worker group should be aborted and in-flight validation tasks canceled before the Train Run is ABORTED. However, this PR doesn't change the current behavior, in which the Train Run might be ABORTED before reference counting cleans up the validation tasks. I will cancel validation tasks before marking the train run ABORTED in a future PR.

I considered polling both the worker group and validations in _step itself, but decided to leave _step as a function that only cares about the worker group.

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a new SHUTTING_DOWN state to the TrainController to better manage the shutdown sequence, particularly when asynchronous validations are pending. This is a thoughtful architectural improvement that enhances the robustness of the training lifecycle. The implementation, including the necessary adjustments to the controller logic and tests, is well-executed. I have identified a couple of minor issues in state.py—a typo in a docstring and an incorrect type hint—which I've detailed in the review comments. Overall, this is a valuable contribution.

python/ray/train/v2/_internal/execution/controller/state.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>

python/ray/train/v2/_internal/execution/controller/state.py

python/ray/train/v2/_internal/execution/controller/controller.py

python/ray/train/v2/_internal/callbacks/state_manager.py

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…ay-project#57882) # Summary The crux of the issue is that in the past, train run status was synonymous with final worker group status, but now, when there are pending validations, the worker group is finished but the train run is not. This leads to confusing situations in which the Train Run is `FINISHED`, but because there are pending validations, the `controller` actor is alive and results are inaccessible. This PR: * Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after the worker group finishes but before the controller shuts everything down. * Makes `ValidationManager` logging slightly cleaner. Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in `StateManager` logs and Grafana but not in the state export. We only want to show terminal states in the state export after `fit()` has returned and results are accessible. More concretely: * Finished/errored: The worker group finishes (Train Run is `RUNNING` but internal state is `SHUTTING_DOWN`), validation finishes (both Train Run and internal state say `FINISHED` or `ERRORED`), then results are accessible. * Aborted: Ideally, the worker group should be aborted and in-flight validation tasks canceled before the Train Run is `ABORTED`. However, this PR doesn't change the current behavior, in which the Train Run might be `ABORTED` before reference counting cleans up the validation tasks. I will cancel validation tasks before marking the train run `ABORTED` in a future PR. I considered polling both the worker group and validations in `_step` itself, but decided to leave `_step` as a function that only cares about the worker group. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xgui <xgui@anyscale.com>

…ay-project#57882) # Summary The crux of the issue is that in the past, train run status was synonymous with final worker group status, but now, when there are pending validations, the worker group is finished but the train run is not. This leads to confusing situations in which the Train Run is `FINISHED`, but because there are pending validations, the `controller` actor is alive and results are inaccessible. This PR: * Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after the worker group finishes but before the controller shuts everything down. * Makes `ValidationManager` logging slightly cleaner. Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in `StateManager` logs and Grafana but not in the state export. We only want to show terminal states in the state export after `fit()` has returned and results are accessible. More concretely: * Finished/errored: The worker group finishes (Train Run is `RUNNING` but internal state is `SHUTTING_DOWN`), validation finishes (both Train Run and internal state say `FINISHED` or `ERRORED`), then results are accessible. * Aborted: Ideally, the worker group should be aborted and in-flight validation tasks canceled before the Train Run is `ABORTED`. However, this PR doesn't change the current behavior, in which the Train Run might be `ABORTED` before reference counting cleans up the validation tasks. I will cancel validation tasks before marking the train run `ABORTED` in a future PR. I considered polling both the worker group and validations in `_step` itself, but decided to leave `_step` as a function that only cares about the worker group. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ay-project#57882) # Summary The crux of the issue is that in the past, train run status was synonymous with final worker group status, but now, when there are pending validations, the worker group is finished but the train run is not. This leads to confusing situations in which the Train Run is `FINISHED`, but because there are pending validations, the `controller` actor is alive and results are inaccessible. This PR: * Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after the worker group finishes but before the controller shuts everything down. * Makes `ValidationManager` logging slightly cleaner. Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in `StateManager` logs and Grafana but not in the state export. We only want to show terminal states in the state export after `fit()` has returned and results are accessible. More concretely: * Finished/errored: The worker group finishes (Train Run is `RUNNING` but internal state is `SHUTTING_DOWN`), validation finishes (both Train Run and internal state say `FINISHED` or `ERRORED`), then results are accessible. * Aborted: Ideally, the worker group should be aborted and in-flight validation tasks canceled before the Train Run is `ABORTED`. However, this PR doesn't change the current behavior, in which the Train Run might be `ABORTED` before reference counting cleans up the validation tasks. I will cancel validation tasks before marking the train run `ABORTED` in a future PR. I considered polling both the worker group and validations in `_step` itself, but decided to leave `_step` as a function that only cares about the worker group. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…ay-project#57882) # Summary The crux of the issue is that in the past, train run status was synonymous with final worker group status, but now, when there are pending validations, the worker group is finished but the train run is not. This leads to confusing situations in which the Train Run is `FINISHED`, but because there are pending validations, the `controller` actor is alive and results are inaccessible. This PR: * Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after the worker group finishes but before the controller shuts everything down. * Makes `ValidationManager` logging slightly cleaner. Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in `StateManager` logs and Grafana but not in the state export. We only want to show terminal states in the state export after `fit()` has returned and results are accessible. More concretely: * Finished/errored: The worker group finishes (Train Run is `RUNNING` but internal state is `SHUTTING_DOWN`), validation finishes (both Train Run and internal state say `FINISHED` or `ERRORED`), then results are accessible. * Aborted: Ideally, the worker group should be aborted and in-flight validation tasks canceled before the Train Run is `ABORTED`. However, this PR doesn't change the current behavior, in which the Train Run might be `ABORTED` before reference counting cleans up the validation tasks. I will cancel validation tasks before marking the train run `ABORTED` in a future PR. I considered polling both the worker group and validations in `_step` itself, but decided to leave `_step` as a function that only cares about the worker group. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

…ay-project#57882) # Summary The crux of the issue is that in the past, train run status was synonymous with final worker group status, but now, when there are pending validations, the worker group is finished but the train run is not. This leads to confusing situations in which the Train Run is `FINISHED`, but because there are pending validations, the `controller` actor is alive and results are inaccessible. This PR: * Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after the worker group finishes but before the controller shuts everything down. * Makes `ValidationManager` logging slightly cleaner. Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in `StateManager` logs and Grafana but not in the state export. We only want to show terminal states in the state export after `fit()` has returned and results are accessible. More concretely: * Finished/errored: The worker group finishes (Train Run is `RUNNING` but internal state is `SHUTTING_DOWN`), validation finishes (both Train Run and internal state say `FINISHED` or `ERRORED`), then results are accessible. * Aborted: Ideally, the worker group should be aborted and in-flight validation tasks canceled before the Train Run is `ABORTED`. However, this PR doesn't change the current behavior, in which the Train Run might be `ABORTED` before reference counting cleans up the validation tasks. I will cancel validation tasks before marking the train run `ABORTED` in a future PR. I considered polling both the worker group and validations in `_step` itself, but decided to leave `_step` as a function that only cares about the worker group. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

[train] Add SHUTTING_DOWN TrainControllerState and improve logging

273a772

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from a team as a code owner October 18, 2025 19:32

gemini-code-assist bot reviewed Oct 18, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/controller/state.py Outdated Show resolved Hide resolved

Update python/ray/train/v2/_internal/execution/controller/state.py

8d570c4

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Timothy Seah <tseah@anyscale.com>

ray-gardener bot added the train Ray Train Related Issue label Oct 19, 2025

matthewdeng reviewed Oct 20, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/controller/state.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/controller/controller.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/callbacks/state_manager.py Outdated Show resolved Hide resolved

address comments

90a1733

Signed-off-by: Timothy Seah <tseah@anyscale.com>

This comment was marked as outdated.

Sign in to view

TimothySeah added 2 commits October 21, 2025 15:57

more cleanup

7c204b8

Signed-off-by: Timothy Seah <tseah@anyscale.com>

move logic to run_control_loop_iteration

9131b99

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from matthewdeng October 21, 2025 23:27

TimothySeah added 2 commits October 21, 2025 16:43

clean up comment

a063fc9

Signed-off-by: Timothy Seah <tseah@anyscale.com>

remove is_hidden attribute

1420fe3

Signed-off-by: Timothy Seah <tseah@anyscale.com>

matthewdeng added the go add ONLY when ready to merge, run all tests label Oct 24, 2025

matthewdeng approved these changes Oct 24, 2025

View reviewed changes

matthewdeng enabled auto-merge (squash) October 24, 2025 23:14

matthewdeng merged commit 3b22b40 into ray-project:master Oct 24, 2025
8 checks passed

TimothySeah mentioned this pull request Dec 4, 2025

[train] Only kick off 1 validation at a time #59128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Add SHUTTING_DOWN TrainControllerState and improve logging#57882

[train] Add SHUTTING_DOWN TrainControllerState and improve logging#57882
matthewdeng merged 7 commits intoray-project:masterfrom
TimothySeah:tseah/train-run-finished-after-validations

TimothySeah commented Oct 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TimothySeah commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TimothySeah commented Oct 18, 2025 •

edited

Loading