[train] Driver SIGINT calls controller abort#53600
[train] Driver SIGINT calls controller abort#53600justinvyu merged 25 commits intoray-project:masterfrom
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
…runs" This reverts commit 1fb936e. Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
Can you write about the async controller design choice (maybe with a diagram of how multiple actor methods share time), compared to the other choices (running the control loop as a thread)?
Also, this is relevant: #53169
- Even if the driver exits ungracefully, Ray Core could enforce that
__del__gets called on the actor process when it gets garbage collected).
| # TODO: These can be attributes of a RunAttempt? | ||
| self._latest_poll_time = float("-inf") | ||
|
|
||
| self._start() |
There was a problem hiding this comment.
I think this change makes sense. This way, we always trigger the callback start hooks before the abort/shutdown hooks.
Although, I think it's also fine if the user aborts the run really quickly and the run never gets registered with the dashboard.
There was a problem hiding this comment.
This is for the corner case in which StateManagerCallback.after_controller_start hasn't been called yet, in which case before_controller_abort will fail because run_id doesn't exist yet. Lmk if this is fine.
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com>
Sure - should that go in the PR description or in the original google design doc? |
|
@TimothySeah Design doc is good. But let's also paste it here so we have a reference for this design decision in the future when look back |
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Added to both PR description and design doc. |
…ring Signed-off-by: Timothy Seah <tseah@anyscale.com>
…ents Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
Great, I think this is really close! One question to discuss before approving.
Can you also add a summary in the PR description?
Done - also made other changes to the pr description. |
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
| # We catch the error and exit 0 to indicate graceful termination. | ||
| # However, for some reason the process still exits with 1. | ||
| sys.exit(0) | ||
|
|
There was a problem hiding this comment.
Why do we want this? Should it just pass over the error and exit with code 130 (interrupted)?
There was a problem hiding this comment.
I have an explanation here: #53600 (comment). Open to suggestions though.
There was a problem hiding this comment.
Oh got it, I think in conjunction with the info log it should good.
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
There was a problem hiding this comment.
Received SIGINT. Gracefully stopping the training run — this may take a few seconds.
To forcefully terminate immediately, you can send a different signal, such as SIGKILL.
There was a problem hiding this comment.
Used your suggested wording but still said abort/aborting instead of stopping/terminate to be consistent with the aborted state. Lmk what you think.
Signed-off-by: Timothy Seah <tseah@anyscale.com>
The goal of this PR is to make it so that when users Ctrl C their ray train driver script, we gracefully terminate the run and mark train runs and train run attempts as `ABORTED`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com> Co-authored-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
The goal of this PR is to make it so that when users Ctrl C their ray train driver script, we gracefully terminate the run and mark train runs and train run attempts as `ABORTED`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Timothy Seah <timothy.seah777@yahoo.com> Co-authored-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>


Summary
The goal of this PR is to make it so that when users Ctrl C their ray train driver script, we gracefully terminate the run and mark train runs and train run attempts as
ABORTED.Implementation Details
Here are the changes to each component:
controller.abortin a blocking fashiondestroy_process_groupon an active group.We decided to implement
abortas an async method instead of a separate thread to avoid race conditions when setting state. See the diagrams below.Testing
Here is my workspace
When I Ctrl C my Ray Train run I see these logs.
which results in this train run page.
Regular training runs still work - here is the overall ray train dashboard page the aborted run and a successful run.
Misc notes
controller._startfromrunto__init__because the newabortmethod callsbefore_controller_abortwhich assumes thatafter_controller_startwas called; for example, the state manager callback can only mark a run as aborted after it was already created.ActorDiedErrorandsys.exit(0)to indicate that this is expected behavior. However, even after doing this, the driver exits with exit code 1. Let me know if this is ok.