[train] Driver SIGINT calls controller abort by TimothySeah · Pull Request #53600 · ray-project/ray

TimothySeah · 2025-06-06T03:22:09Z

Summary

The goal of this PR is to make it so that when users Ctrl C their ray train driver script, we gracefully terminate the run and mark train runs and train run attempts as ABORTED.

Implementation Details

Here are the changes to each component:

driver: sigint handler that calls controller.abort in a blocking fashion
controller/worker_group: mark train run and train run attempts as aborted and exit
workers: no change - for now we rely on ray's object reference counting to clean up workers. This is safe because the workers cannot modify train run and train run attempt state if the controller has exited. We can consider gracefully terminating the workers in the future but we don't for now since there is some risk of hanging e.g. when we call destroy_process_group on an active group.

We decided to implement abort as an async method instead of a separate thread to avoid race conditions when setting state. See the diagrams below.

Testing

Here is my workspace

When I Ctrl C my Ray Train run I see these logs.

which results in this train run page.

Regular training runs still work - here is the overall ray train dashboard page the aborted run and a successful run.

Misc notes

I moved controller._start from run to __init__ because the new abort method calls before_controller_abort which assumes that after_controller_start was called; for example, the state manager callback can only mark a run as aborted after it was already created.
I made the driver sigint handler catch the ActorDiedError and sys.exit(0) to indicate that this is expected behavior. However, even after doing this, the driver exits with exit code 1. Let me know if this is ok.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…runs" This reverts commit 1fb936e. Signed-off-by: Timothy Seah <tseah@anyscale.com>

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…oller

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu

Can you write about the async controller design choice (maybe with a diagram of how multiple actor methods share time), compared to the other choices (running the control loop as a thread)?

Also, this is relevant: #53169

Even if the driver exits ungracefully, Ray Core could enforce that __del__ gets called on the actor process when it gets garbage collected).

python/ray/train/v2/_internal/execution/callback.py

justinvyu · 2025-06-10T23:42:54Z

python/ray/train/v2/_internal/execution/controller/controller.py

        # TODO: These can be attributes of a RunAttempt?
        self._latest_poll_time = float("-inf")

+        self._start()


I think this change makes sense. This way, we always trigger the callback start hooks before the abort/shutdown hooks.

Although, I think it's also fine if the user aborts the run really quickly and the run never gets registered with the dashboard.

This is for the corner case in which StateManagerCallback.after_controller_start hasn't been called yet, in which case before_controller_abort will fail because run_id doesn't exist yet. Lmk if this is fine.

python/ray/train/v2/_internal/execution/controller/controller.py

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

TimothySeah · 2025-06-11T02:06:06Z

Can you write about the async controller design choice (maybe with a diagram of how multiple actor methods share time), compared to the other choices (running the control loop as a thread)?

Sure - should that go in the PR description or in the original google design doc?

justinvyu · 2025-06-11T18:36:30Z

@TimothySeah Design doc is good. But let's also paste it here so we have a reference for this design decision in the future when look back

Signed-off-by: Timothy Seah <tseah@anyscale.com>

python/ray/train/v2/_internal/execution/callback.py

python/ray/train/v2/api/data_parallel_trainer.py

TimothySeah · 2025-06-12T20:37:45Z

@TimothySeah Design doc is good. But let's also paste it here so we have a reference for this design decision in the future when look back

Added to both PR description and design doc.

…ring Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah · 2025-06-13T20:35:18Z

Confirmed still works as expected: https://console.anyscale-staging.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_92c7b71w55flm6gv6imv4m6vqg/workspaces/expwrk_l5azm3lr5q1urw63wzp4bh7s7n/train?workspace-tab=ray-turbo-dashboard&command-history-section=application_logs&raySession=all

python/ray/train/v2/_internal/callbacks/state_manager.py

python/ray/train/v2/tests/test_controller.py

python/ray/train/v2/_internal/callbacks/state_manager.py

python/ray/train/v2/api/data_parallel_trainer.py

…ents Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah · 2025-06-15T04:45:12Z

Confirmed still works

python/ray/train/v2/api/data_parallel_trainer.py

python/ray/train/v2/tests/test_state.py

python/ray/train/v2/tests/test_data_parallel_trainer.py

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu

Great, I think this is really close! One question to discuss before approving.

Can you also add a summary in the PR description?

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

python/ray/train/v2/tests/test_worker_group.py

TimothySeah · 2025-06-18T22:32:50Z

Can you also add a summary in the PR description?

Done - also made other changes to the pr description.

justinvyu

🚀

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…oller

Signed-off-by: Timothy Seah <tseah@anyscale.com>

python/ray/train/v2/_internal/state/state_manager.py

matthewdeng · 2025-06-23T21:31:58Z

python/ray/train/v2/api/data_parallel_trainer.py

+                # We catch the error and exit 0 to indicate graceful termination.
+                # However, for some reason the process still exits with 1.
+                sys.exit(0)
+


Why do we want this? Should it just pass over the error and exit with code 130 (interrupted)?

I have an explanation here: #53600 (comment). Open to suggestions though.

Oh got it, I think in conjunction with the info log it should good.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

python/ray/train/v2/_internal/execution/controller/controller.py

justinvyu · 2025-06-24T01:37:59Z

python/ray/train/v2/api/data_parallel_trainer.py

Received SIGINT. Gracefully stopping the training run — this may take a few seconds. To forcefully terminate immediately, you can send a different signal, such as SIGKILL.

Used your suggested wording but still said abort/aborting instead of stopping/terminate to be consistent with the aborted state. Lmk what you think.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

The goal of this PR is to make it so that when users Ctrl C their ray train driver script, we gracefully terminate the run and mark train runs and train run attempts as `ABORTED`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com> Co-authored-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>

The goal of this PR is to make it so that when users Ctrl C their ray train driver script, we gracefully terminate the run and mark train runs and train run attempts as `ABORTED`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com> Co-authored-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

TimothySeah added 7 commits June 5, 2025 20:21

[train] Driver SIGINT calls controller abort

c0d17a5

Signed-off-by: Timothy Seah <tseah@anyscale.com>

be consistent with sys.exit status

af3ade2

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Fix unit test + improve comment

8d27d44

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Add unit tests

d3badd5

Signed-off-by: Timothy Seah <tseah@anyscale.com>

mark controller actor id optional to enable local controller runs

1fb936e

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Revert "mark controller actor id optional to enable local controller …

7f0c977

…runs" This reverts commit 1fb936e. Signed-off-by: Timothy Seah <tseah@anyscale.com>

Remove local controller case

31a40f2

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah marked this pull request as ready for review June 10, 2025 00:17

TimothySeah requested a review from a team as a code owner June 10, 2025 00:17

TimothySeah requested review from justinvyu and matthewdeng June 10, 2025 00:17

TimothySeah added 5 commits June 9, 2025 18:01

Merge remote-tracking branch 'upstream/master' into tseah/async-contr…

4423e14

…oller

exit with ray.actor.exit_actor() instead

d0959ad

Signed-off-by: Timothy Seah <tseah@anyscale.com>

fix tune controller as driver path

c6f7eb5

Signed-off-by: Timothy Seah <tseah@anyscale.com>

try ignore reinit error

c687097

Signed-off-by: Timothy Seah <tseah@anyscale.com>

skip strange test

a74417a

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu reviewed Jun 10, 2025

View reviewed changes

Update python/ray/train/v2/_internal/execution/controller/controller.py

7e2868d

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>

Add todo comment

6f001e2

Signed-off-by: Timothy Seah <tseah@anyscale.com>

matthewdeng reviewed Jun 12, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/callback.py Outdated Show resolved Hide resolved

python/ray/train/v2/api/data_parallel_trainer.py Outdated Show resolved Hide resolved

address pr feedback: register sigint helper function + improved docst…

4414f35

…ring Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested review from justinvyu and matthewdeng June 12, 2025 20:47

TimothySeah added the go add ONLY when ready to merge, run all tests label Jun 13, 2025

Merge branch 'master' into tseah/async-controller

077ac3a

justinvyu reviewed Jun 13, 2025

View reviewed changes

address feedback: abort with workergroupcallback, clean up tests/comm…

9f00482

…ents Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from justinvyu June 15, 2025 03:15

matthewdeng reviewed Jun 17, 2025

View reviewed changes

TimothySeah added 2 commits June 16, 2025 17:45

fix unit tests

51f53fb

Signed-off-by: Timothy Seah <tseah@anyscale.com>

address pr feedback: clean up code

5fa6c46

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from matthewdeng June 17, 2025 00:50

justinvyu reviewed Jun 17, 2025

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/worker_group.py Show resolved Hide resolved

python/ray/train/v2/tests/test_worker_group.py Show resolved Hide resolved

justinvyu approved these changes Jun 18, 2025

View reviewed changes

TimothySeah added 3 commits June 18, 2025 15:58

Add comment and change unit test

4e32d48

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tseah/async-contr…

8bc2925

…oller

shut down to fix unit test

2513242

Signed-off-by: Timothy Seah <tseah@anyscale.com>

matthewdeng reviewed Jun 23, 2025

View reviewed changes

TimothySeah added 2 commits June 23, 2025 16:40

swap attempt and run abort to avoid incomplete state

0d06e6b

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Add sigint handler logging so users know why exiting is so slow

614abc7

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu reviewed Jun 24, 2025

View reviewed changes

Add comment and improve signal handling message

e66adc0

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu merged commit 8af8dae into ray-project:master Jun 25, 2025
5 checks passed

TimothySeah mentioned this pull request Jun 27, 2025

[train] Make worker group start and poll async #54181

Closed

Conversation

TimothySeah commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation Details

Testing

Misc notes

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

justinvyu Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

TimothySeah Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TimothySeah commented Jun 11, 2025

Uh oh!

justinvyu commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

TimothySeah commented Jun 12, 2025

Uh oh!

TimothySeah commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TimothySeah commented Jun 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TimothySeah commented Jun 18, 2025

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

matthewdeng Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

TimothySeah Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

matthewdeng Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

justinvyu Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

TimothySeah Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TimothySeah commented Jun 6, 2025 •

edited

Loading