Skip to content

[train] Add SHUTTING_DOWN TrainControllerState and improve logging#57882

Merged
matthewdeng merged 7 commits intoray-project:masterfrom
TimothySeah:tseah/train-run-finished-after-validations
Oct 24, 2025
Merged

[train] Add SHUTTING_DOWN TrainControllerState and improve logging#57882
matthewdeng merged 7 commits intoray-project:masterfrom
TimothySeah:tseah/train-run-finished-after-validations

Conversation

@TimothySeah
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah commented Oct 18, 2025

Summary

The crux of the issue is that in the past, train run status was synonymous with final worker group status, but now, when there are pending validations, the worker group is finished but the train run is not. This leads to confusing situations in which the Train Run is FINISHED, but because there are pending validations, the controller actor is alive and results are inaccessible.

This PR:

  • Adds a new SHUTTING_DOWN TrainControllerState that happens after the worker group finishes but before the controller shuts everything down.
  • Makes ValidationManager logging slightly cleaner.

Like RESCHEDULING, SHUTTING_DOWN is a hidden state that shows up in StateManager logs and Grafana but not in the state export. We only want to show terminal states in the state export after fit() has returned and results are accessible. More concretely:

  • Finished/errored: The worker group finishes (Train Run is RUNNING but internal state is SHUTTING_DOWN), validation finishes (both Train Run and internal state say FINISHED or ERRORED), then results are accessible.
  • Aborted: Ideally, the worker group should be aborted and in-flight validation tasks canceled before the Train Run is ABORTED. However, this PR doesn't change the current behavior, in which the Train Run might be ABORTED before reference counting cleans up the validation tasks. I will cancel validation tasks before marking the train run ABORTED in a future PR.

I considered polling both the worker group and validations in _step itself, but decided to leave _step as a function that only cares about the worker group.

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner October 18, 2025 19:32
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new SHUTTING_DOWN state to the TrainController to better manage the shutdown sequence, particularly when asynchronous validations are pending. This is a thoughtful architectural improvement that enhances the robustness of the training lifecycle. The implementation, including the necessary adjustments to the controller logic and tests, is well-executed. I have identified a couple of minor issues in state.py—a typo in a docstring and an incorrect type hint—which I've detailed in the review comments. Overall, this is a valuable contribution.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Oct 19, 2025
Signed-off-by: Timothy Seah <tseah@anyscale.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@matthewdeng matthewdeng added the go add ONLY when ready to merge, run all tests label Oct 24, 2025
@matthewdeng matthewdeng enabled auto-merge (squash) October 24, 2025 23:14
@matthewdeng matthewdeng merged commit 3b22b40 into ray-project:master Oct 24, 2025
8 checks passed
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 27, 2025
…ay-project#57882)

# Summary

The crux of the issue is that in the past, train run status was
synonymous with final worker group status, but now, when there are
pending validations, the worker group is finished but the train run is
not. This leads to confusing situations in which the Train Run is
`FINISHED`, but because there are pending validations, the `controller`
actor is alive and results are inaccessible.

This PR:
* Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after
the worker group finishes but before the controller shuts everything
down.
* Makes `ValidationManager` logging slightly cleaner.

Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in
`StateManager` logs and Grafana but not in the state export. We only
want to show terminal states in the state export after `fit()` has
returned and results are accessible. More concretely:
* Finished/errored: The worker group finishes (Train Run is `RUNNING`
but internal state is `SHUTTING_DOWN`), validation finishes (both Train
Run and internal state say `FINISHED` or `ERRORED`), then results are
accessible.
* Aborted: Ideally, the worker group should be aborted and in-flight
validation tasks canceled before the Train Run is `ABORTED`. However,
this PR doesn't change the current behavior, in which the Train Run
might be `ABORTED` before reference counting cleans up the validation
tasks. I will cancel validation tasks before marking the train run
`ABORTED` in a future PR.

I considered polling both the worker group and validations in `_step`
itself, but decided to leave `_step` as a function that only cares about
the worker group.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ay-project#57882)

# Summary

The crux of the issue is that in the past, train run status was
synonymous with final worker group status, but now, when there are
pending validations, the worker group is finished but the train run is
not. This leads to confusing situations in which the Train Run is
`FINISHED`, but because there are pending validations, the `controller`
actor is alive and results are inaccessible.

This PR:
* Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after
the worker group finishes but before the controller shuts everything
down.
* Makes `ValidationManager` logging slightly cleaner.
 
Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in
`StateManager` logs and Grafana but not in the state export. We only
want to show terminal states in the state export after `fit()` has
returned and results are accessible. More concretely:
* Finished/errored: The worker group finishes (Train Run is `RUNNING`
but internal state is `SHUTTING_DOWN`), validation finishes (both Train
Run and internal state say `FINISHED` or `ERRORED`), then results are
accessible.
* Aborted: Ideally, the worker group should be aborted and in-flight
validation tasks canceled before the Train Run is `ABORTED`. However,
this PR doesn't change the current behavior, in which the Train Run
might be `ABORTED` before reference counting cleans up the validation
tasks. I will cancel validation tasks before marking the train run
`ABORTED` in a future PR.

I considered polling both the worker group and validations in `_step`
itself, but decided to leave `_step` as a function that only cares about
the worker group.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ay-project#57882)

# Summary

The crux of the issue is that in the past, train run status was
synonymous with final worker group status, but now, when there are
pending validations, the worker group is finished but the train run is
not. This leads to confusing situations in which the Train Run is
`FINISHED`, but because there are pending validations, the `controller`
actor is alive and results are inaccessible.

This PR:
* Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after
the worker group finishes but before the controller shuts everything
down.
* Makes `ValidationManager` logging slightly cleaner.

Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in
`StateManager` logs and Grafana but not in the state export. We only
want to show terminal states in the state export after `fit()` has
returned and results are accessible. More concretely:
* Finished/errored: The worker group finishes (Train Run is `RUNNING`
but internal state is `SHUTTING_DOWN`), validation finishes (both Train
Run and internal state say `FINISHED` or `ERRORED`), then results are
accessible.
* Aborted: Ideally, the worker group should be aborted and in-flight
validation tasks canceled before the Train Run is `ABORTED`. However,
this PR doesn't change the current behavior, in which the Train Run
might be `ABORTED` before reference counting cleans up the validation
tasks. I will cancel validation tasks before marking the train run
`ABORTED` in a future PR.

I considered polling both the worker group and validations in `_step`
itself, but decided to leave `_step` as a function that only cares about
the worker group.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ay-project#57882)

# Summary

The crux of the issue is that in the past, train run status was
synonymous with final worker group status, but now, when there are
pending validations, the worker group is finished but the train run is
not. This leads to confusing situations in which the Train Run is
`FINISHED`, but because there are pending validations, the `controller`
actor is alive and results are inaccessible.

This PR:
* Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after
the worker group finishes but before the controller shuts everything
down.
* Makes `ValidationManager` logging slightly cleaner.

Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in
`StateManager` logs and Grafana but not in the state export. We only
want to show terminal states in the state export after `fit()` has
returned and results are accessible. More concretely:
* Finished/errored: The worker group finishes (Train Run is `RUNNING`
but internal state is `SHUTTING_DOWN`), validation finishes (both Train
Run and internal state say `FINISHED` or `ERRORED`), then results are
accessible.
* Aborted: Ideally, the worker group should be aborted and in-flight
validation tasks canceled before the Train Run is `ABORTED`. However,
this PR doesn't change the current behavior, in which the Train Run
might be `ABORTED` before reference counting cleans up the validation
tasks. I will cancel validation tasks before marking the train run
`ABORTED` in a future PR.

I considered polling both the worker group and validations in `_step`
itself, but decided to leave `_step` as a function that only cares about
the worker group.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ay-project#57882)

# Summary

The crux of the issue is that in the past, train run status was
synonymous with final worker group status, but now, when there are
pending validations, the worker group is finished but the train run is
not. This leads to confusing situations in which the Train Run is
`FINISHED`, but because there are pending validations, the `controller`
actor is alive and results are inaccessible.

This PR:
* Adds a new `SHUTTING_DOWN` `TrainControllerState` that happens after
the worker group finishes but before the controller shuts everything
down.
* Makes `ValidationManager` logging slightly cleaner.

Like `RESCHEDULING`, `SHUTTING_DOWN` is a hidden state that shows up in
`StateManager` logs and Grafana but not in the state export. We only
want to show terminal states in the state export after `fit()` has
returned and results are accessible. More concretely:
* Finished/errored: The worker group finishes (Train Run is `RUNNING`
but internal state is `SHUTTING_DOWN`), validation finishes (both Train
Run and internal state say `FINISHED` or `ERRORED`), then results are
accessible.
* Aborted: Ideally, the worker group should be aborted and in-flight
validation tasks canceled before the Train Run is `ABORTED`. However,
this PR doesn't change the current behavior, in which the Train Run
might be `ABORTED` before reference counting cleans up the validation
tasks. I will cancel validation tasks before marking the train run
`ABORTED` in a future PR.

I considered polling both the worker group and validations in `_step`
itself, but decided to leave `_step` as a function that only cares about
the worker group.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants