[PlacementGroup]Fix the bug that actor does not restart#13321
[PlacementGroup]Fix the bug that actor does not restart#13321ffbin wants to merge 1 commit intoray-project:masterfrom
Conversation
|
I think we discussed this before, but this is intentional. Imagine your placement group is removed, and that destroys the actor. In this case, restarting actor doesn't mean anything because it can never be scheduled. |
If GCS restarts during PG creation, GCS will release useless bundles, which will cause actor not to restart. |
|
Hmm I dont' think I follow this sentence; If GCS restarts during PG creation, GCS will release useless bundles, which will cause actor not to restart. But regardless, this change is dangerous. This will make the user experience really bad (because whenever user removes the placement group, they will get tons of spam messages about unintentional worker crashes) |
|
why is the actor scheduled on a useless bundle in the first place? |
PG and actor are created asynchronously. If GCS restarts in the process of creating a PG, in order to avoid PG bundle resource leakage, GCS will inform raylet to release the bundles not recorded in GCS. However, the bundle may already be used by an actor, so canceling the bundle will cause the actor to exit internally and not restart again. |
|
It is still unacceptable to have this change because fate-sharing is a feature of pg, and it is definitely intentional (and otherwise we will spam users with wrong info). What about this? We modify the remove bundle signature to have “intentional_bundle_removal” default true, and when we delete bundles cuz of pg fault tolernace protocol, we set this false? |
|
@rkooo567 #13320 and this PR will cause the following problems: detached actor A creates PG and uses PG to create detached actor B. If actor A restarts, it will cause PG to be removed, and PG will kill actor B when it is deleted. Since actor B is marked as an internal exit, actor B will not be restarted. |
SGTM, My current implementation is a temporary solution, mainly to quickly solve the problems mentioned above. |
|
Hi @rkooo567 https://github.com/oliverhu/ray/pull/1/files#diff-cdc6bf2bb23d0a02cc86196ff825c08b13cb3ac312a22796488455dfe0029e55R176 Does the above scheme depend on this PR? cc @oliverhu |
|
I think #13527 will fix this bug, so close this PR. |
Why are these changes needed?
If an actor uses a placeemt group bundle, the actor will not restart when the bundle is cancelled.
Related issue number
Checks
scripts/format.shto lint the changes in this PR.