Skip to content

[train] Driver SIGINT calls controller abort#53600

Merged
justinvyu merged 25 commits intoray-project:masterfrom
TimothySeah:tseah/async-controller
Jun 25, 2025
Merged

[train] Driver SIGINT calls controller abort#53600
justinvyu merged 25 commits intoray-project:masterfrom
TimothySeah:tseah/async-controller

Conversation

@TimothySeah
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah commented Jun 6, 2025

Summary

The goal of this PR is to make it so that when users Ctrl C their ray train driver script, we gracefully terminate the run and mark train runs and train run attempts as ABORTED.

Implementation Details

Here are the changes to each component:

  • driver: sigint handler that calls controller.abort in a blocking fashion
  • controller/worker_group: mark train run and train run attempts as aborted and exit
  • workers: no change - for now we rely on ray's object reference counting to clean up workers. This is safe because the workers cannot modify train run and train run attempt state if the controller has exited. We can consider gracefully terminating the workers in the future but we don't for now since there is some risk of hanging e.g. when we call destroy_process_group on an active group.

We decided to implement abort as an async method instead of a separate thread to avoid race conditions when setting state. See the diagrams below.

Screenshot 2025-06-12 at 1 30 23 PM Screenshot 2025-06-12 at 1 30 29 PM

Testing

Here is my workspace

When I Ctrl C my Ray Train run I see these logs.

which results in this train run page.

Screenshot 2025-06-09 at 11 08 22 PM

Regular training runs still work - here is the overall ray train dashboard page the aborted run and a successful run.

Screenshot 2025-06-09 at 11 08 09 PM

Misc notes

  • I moved controller._start from run to __init__ because the new abort method calls before_controller_abort which assumes that after_controller_start was called; for example, the state manager callback can only mark a run as aborted after it was already created.
  • I made the driver sigint handler catch the ActorDiedError and sys.exit(0) to indicate that this is expected behavior. However, even after doing this, the driver exits with exit code 1. Let me know if this is ok.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
…runs"

This reverts commit 1fb936e.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah marked this pull request as ready for review June 10, 2025 00:17
@TimothySeah TimothySeah requested a review from a team as a code owner June 10, 2025 00:17
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write about the async controller design choice (maybe with a diagram of how multiple actor methods share time), compared to the other choices (running the control loop as a thread)?

Also, this is relevant: #53169

  • Even if the driver exits ungracefully, Ray Core could enforce that __del__ gets called on the actor process when it gets garbage collected).

# TODO: These can be attributes of a RunAttempt?
self._latest_poll_time = float("-inf")

self._start()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change makes sense. This way, we always trigger the callback start hooks before the abort/shutdown hooks.

Although, I think it's also fine if the user aborts the run really quickly and the run never gets registered with the dashboard.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the corner case in which StateManagerCallback.after_controller_start hasn't been called yet, in which case before_controller_abort will fail because run_id doesn't exist yet. Lmk if this is fine.

Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
@TimothySeah
Copy link
Copy Markdown
Contributor Author

Can you write about the async controller design choice (maybe with a diagram of how multiple actor methods share time), compared to the other choices (running the control loop as a thread)?

Sure - should that go in the PR description or in the original google design doc?

@justinvyu
Copy link
Copy Markdown
Contributor

@TimothySeah Design doc is good. But let's also paste it here so we have a reference for this design decision in the future when look back

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah
Copy link
Copy Markdown
Contributor Author

@TimothySeah Design doc is good. But let's also paste it here so we have a reference for this design decision in the future when look back

Added to both PR description and design doc.

…ring

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Jun 13, 2025
…ents

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from justinvyu June 15, 2025 03:15
@TimothySeah
Copy link
Copy Markdown
Contributor Author

Confirmed still works

Screenshot 2025-06-14 at 9 42 13 PM
Screenshot 2025-06-14 at 9 42 27 PM

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from matthewdeng June 17, 2025 00:50
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I think this is really close! One question to discuss before approving.

Can you also add a summary in the PR description?

@TimothySeah
Copy link
Copy Markdown
Contributor Author

Can you also add a summary in the PR description?

Done - also made other changes to the pr description.

Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Comment on lines +218 to +221
# We catch the error and exit 0 to indicate graceful termination.
# However, for some reason the process still exits with 1.
sys.exit(0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want this? Should it just pass over the error and exit with code 130 (interrupted)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an explanation here: #53600 (comment). Open to suggestions though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh got it, I think in conjunction with the info log it should good.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Comment on lines 217 to 219
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Received SIGINT. Gracefully stopping the training run — this may take a few seconds.
To forcefully terminate immediately, you can send a different signal, such as SIGKILL.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used your suggested wording but still said abort/aborting instead of stopping/terminate to be consistent with the aborted state. Lmk what you think.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@justinvyu justinvyu merged commit 8af8dae into ray-project:master Jun 25, 2025
5 checks passed
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
The goal of this PR is to make it so that when users Ctrl C their ray
train driver script, we gracefully terminate the run and mark train runs
and train run attempts as `ABORTED`.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
Co-authored-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
The goal of this PR is to make it so that when users Ctrl C their ray
train driver script, we gracefully terminate the run and mark train runs
and train run attempts as `ABORTED`.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
Co-authored-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants