[core] prevent sending SIGTERM after calling Worker::MarkDead#54377
[core] prevent sending SIGTERM after calling Worker::MarkDead#54377edoakes merged 2 commits intoray-project:masterfrom
Conversation
f9e3499 to
1eec2b8
Compare
|
Hi @codope, could you help review this? Thanks! |
codope
left a comment
There was a problem hiding this comment.
Is change in behavior strictly necessary? Previously you could distinguish between "marked for death" and "actually killing". Now we send SIGTERM to workers after failed Exit RPCs, which could be more disruptive if workers are legitimately unable to exit immediately (e.g., due to ongoing object references). I think this might be an issue for borrowers. If workers are killed forcefully, they cannot send WaitForRefRemoved messages to object owners. It could cause other workers to suddenly lose access to objects they were borrowing. In case of KillAsync, we would still want the idle workers to allow to drain their references properly. I am wondering if we should retain MarkDead() for graceful shutdown and KillAsync() only for truly unresponsive workers.
src/ray/raylet/worker.cc
Outdated
There was a problem hiding this comment.
Let's add a doc to clarify the new semantics of IsDead()
src/ray/raylet/worker_pool.cc
Outdated
There was a problem hiding this comment.
Do we have tests to validate the two scenarios mentioned in the comment?
src/ray/raylet/worker_pool_test.cc
Outdated
There was a problem hiding this comment.
All these tests are doing force kill. Let's also cover when force is false.
src/ray/raylet/worker_pool_test.cc
Outdated
There was a problem hiding this comment.
Do we need to call TryKillingIdleWorkers if force kill is true?
|
Thanks for the input @codope. It is super clear now that we should not send SIGTERM to workers after the
Do you mean that we should also not send SIGTERM in the case of |
Yes, I think we should not send SIGTERM in the |
1eec2b8 to
d75bfb6
Compare
Signed-off-by: Rueian <rueian@anyscale.com>
d75bfb6 to
5da4d13
Compare
Ok! Now, I only merge the flags but keep the Just curious, shouldn't the SIGTERM handling in the core worker do nothing after the |
codope
left a comment
There was a problem hiding this comment.
@rueian Thanks for addressing the comments. Looks good to me.
To answer your question, SIGTERM doesn't run on the c++ io service event loop --
ray/src/ray/core_worker/core_worker.cc
Line 1291 in 51235c4
It's handled in the main thread.
…oject#54377) ## Why are these changes needed? Following [the previous discussion](ray-project#54068 (comment)), this PR merges the flags used by `Worker::MarkDead` and `Worker::KillAsync`. Currently, `Worker::MarkDead` will only be called after we ask the worker to shutdown via the `Exit` RPC. After the `Exit` RPC, the worker should shut down by itself. This change essentially prevents sending `SIGTERM` from the node manager to a worker once its `Worker::MarkDead` method has been called, ensuring that we don't interrupt its shutdown process. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Rueian <rueian@anyscale.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>
…oject#54377) ## Why are these changes needed? Following [the previous discussion](ray-project#54068 (comment)), this PR merges the flags used by `Worker::MarkDead` and `Worker::KillAsync`. Currently, `Worker::MarkDead` will only be called after we ask the worker to shutdown via the `Exit` RPC. After the `Exit` RPC, the worker should shut down by itself. This change essentially prevents sending `SIGTERM` from the node manager to a worker once its `Worker::MarkDead` method has been called, ensuring that we don't interrupt its shutdown process. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Rueian <rueian@anyscale.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Krishna Kalyan <krishnakalyan3@gmail.com>
…oject#54377) ## Why are these changes needed? Following [the previous discussion](ray-project#54068 (comment)), this PR merges the flags used by `Worker::MarkDead` and `Worker::KillAsync`. Currently, `Worker::MarkDead` will only be called after we ask the worker to shutdown via the `Exit` RPC. After the `Exit` RPC, the worker should shut down by itself. This change essentially prevents sending `SIGTERM` from the node manager to a worker once its `Worker::MarkDead` method has been called, ensuring that we don't interrupt its shutdown process. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Rueian <rueian@anyscale.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…oject#54377) ## Why are these changes needed? Following [the previous discussion](ray-project#54068 (comment)), this PR merges the flags used by `Worker::MarkDead` and `Worker::KillAsync`. Currently, `Worker::MarkDead` will only be called after we ask the worker to shutdown via the `Exit` RPC. After the `Exit` RPC, the worker should shut down by itself. This change essentially prevents sending `SIGTERM` from the node manager to a worker once its `Worker::MarkDead` method has been called, ensuring that we don't interrupt its shutdown process. ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Rueian <rueian@anyscale.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Why are these changes needed?
Following the previous discussion, this PR merges the flags used by
Worker::MarkDeadandWorker::KillAsync.Currently,
Worker::MarkDeadwill only be called after we ask the worker to shutdown via theExitRPC. After theExitRPC, the worker should shut down by itself.This change essentially prevents sending
SIGTERMfrom the node manager to a worker once itsWorker::MarkDeadmethod has been called, ensuring that we don't interrupt its shutdown process.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.