Remove deadlock when deploying PCS with ComputeDomain#215
Conversation
for pods belonging to scaled pod gangs. Previously we used to check for ready pods. This is required to remove deadlock when using computing domain (DRA). Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Regenerated the api docs and crds Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
|
I know that it solves an important deadlock issue but I thought that waiting for the base pod gang to be ready was needed before launching the scale pod gang. |
The original thinking was to ensure that gang-scheduled pods from the base podgang will need to become ready first, which would have ensured the scheduled pods avoided any faulty GPUs. However, that reasoning is too conservative since the scheduling on bad GPUs is essentially a concurrency issue between fault-detection and scheduler binding of pods. While it is possible, the chances of that happening are slim since the scheduler would update its own cache of healthy pool of resources on every scheduling cycle. So, the new reasoning is to delegate any such fault-handling to the termination workflow. So, while this PR not only fixes the deadlock issue, it would also ensure correct behavior when faults are encountered concurrently while scheduling. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When using DRA ComputeDomain on NVL72 then it will enter into a deadlock for a PCS which has
minAvailable<replicasfor PCSG.Consider the following scenario:
ComputeDomainwith numNodes = 18minAvailable<replicasfor the prefill or/and decode PCSGs.Lets assume that you have a total of 12 pods (across prefill and decode PCSGs) that are part of the base pod gang. These have a reference to compute domain. Grove operator will first try and schedule and start the base pod gang. When the based pod gang is ready, only then it will lift the
scheduling gatesfor the pods belonging to the scaled pod gangs.When based pod gang pods are started, they will not come up as
ComputeDomain.Status.Nodesonly lists 12 nodes andComputeDomain.Status.Statusis set toNotReadyas it expects 18 nodes and currently there are only 12. Due to scheduling gates on the remaining 6 pods (belong to the scaled pod gang) kube-scheduler (or equivalent) never picks these pods to be scheduled.As a result you end up into a deadlock situation where the base pod gang pods will never start as the ComputeDomain resource on which they have a dependency never gets
Readyand since based pod gang pods do not start the scaled pod gang pods will continue to be schedule gated.This PR resolves this deadlock by changing the predicate to remove the scheduling gates for pods from scaled pod gangs.
Instead of waiting for the pods to be ready (num readyPods >= minAvailable) it now only waits for (num scheduledPods >= minAvailable).
Which issue(s) this PR fixes:
Fixes #209
Special notes for your reviewer:
Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: