[Core] Fix placement group leaks#42942
Conversation
|
can we have a unit test, e.g. start a pg and kill it right away, and see if there's any leak? |
|
@rynewang I am figuring that out. This requires very subtle timing (indeed, I could repro when I created so many pgs only), so writing an unit test is not trivial. Writing cpp unit test is even harder because of how our code is structured. |
| // Retry 10 * worker registeration timeout to avoid race condition. | ||
| // See https://github.com/ray-project/ray/pull/42942 | ||
| // for more details. | ||
| /*max_retry*/ RayConfig::instance().worker_register_timeout_seconds() * 10, |
There was a problem hiding this comment.
Q: should we do exponential backoff? Right now, we retry every 1 second.
|
@jjyao it is ready to be reviewed |
|
Q: Right now ,we retry for 10 minutes with 1 second interval. Should we
|
When can it happen? |
When a worker is not started yet! |
| const rpc::CancelResourceReserveReply &reply) { | ||
| RAY_LOG(DEBUG) << "Finished cancelling the resource reserved for bundle: " | ||
| << bundle_spec->DebugString() << " at node " << node_id; | ||
| [this, bundle_spec, node_id, node, max_retry, current_retry_cnt]( |
There was a problem hiding this comment.
Where do you use the max retry
There was a problem hiding this comment.
oops.. let me fix this. I will also add an unit test
|
|
||
| void ClusterResourceManager::AddOrUpdateNode(scheduling::NodeID node_id, | ||
| const NodeResources &node_resources) { | ||
| RAY_LOG(DEBUG) << "Update node info, node_id: " << node_id.ToInt() |
There was a problem hiding this comment.
This seems useless and too verbose. I can bring it back if you think it is necessary
|
@jjyao it'd be great if you can approve the PR. I will add changes and merge it |
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started". It fixes the issue by retrying cancellation. This also means If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well. Signed-off-by: Ratnopam Chakrabarti <ratnopamc@yahoo.com>
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started". It fixes the issue by retrying cancellation. This also means If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well.
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started". It fixes the issue by retrying cancellation. This also means If a worker starts late (it has 60 seconds timeout), retry can fail as we have max number of retry. We retry for a very long time (10 * registration timeout), so it is unlikely to happen Alternatively, to improve the consistency, we can also do register removed pg and keep deleting resources (with a reconciler) until it is fully gone. register leased workers before workers are started. This can be a better solution, but the implication of this change is unknown. I chose the current solution as it is needed for a network issue as well.
Why are these changes needed?
Root cause: When we schedule a task, we allocate resources and start a worker. If cancel bundle request is received before a worker is started, there's a leak because cancel bundle kills a worker and returns resources from "workers that are already started".
It fixes the issue by retrying cancellation. This also means
Alternatively, to improve the consistency, we can also do
I chose the current solution as it is needed for a network issue as well.
Related issue number
Closes #26761
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.