Skip to content

Remove deadlock when deploying PCS with ComputeDomain#215

Merged
unmarshall merged 2 commits into
ai-dynamo:mainfrom
unmarshall:dra
Oct 8, 2025
Merged

Remove deadlock when deploying PCS with ComputeDomain#215
unmarshall merged 2 commits into
ai-dynamo:mainfrom
unmarshall:dra

Conversation

@unmarshall

@unmarshall unmarshall commented Oct 7, 2025

Copy link
Copy Markdown
Collaborator

What type of PR is this?

/kind bug

What this PR does / why we need it:

When using DRA ComputeDomain on NVL72 then it will enter into a deadlock for a PCS which has minAvailable < replicas for PCSG.

Consider the following scenario:

  • Create a PCS for disaggregated inference with prefills and decodes being modeled as separate PCSG.
  • Create a ComputeDomain with numNodes = 18
  • Ensure that your minAvailable < replicas for the prefill or/and decode PCSGs.

Lets assume that you have a total of 12 pods (across prefill and decode PCSGs) that are part of the base pod gang. These have a reference to compute domain. Grove operator will first try and schedule and start the base pod gang. When the based pod gang is ready, only then it will lift the scheduling gates for the pods belonging to the scaled pod gangs.

When based pod gang pods are started, they will not come up as ComputeDomain.Status.Nodes only lists 12 nodes and ComputeDomain.Status.Status is set to NotReady as it expects 18 nodes and currently there are only 12. Due to scheduling gates on the remaining 6 pods (belong to the scaled pod gang) kube-scheduler (or equivalent) never picks these pods to be scheduled.

As a result you end up into a deadlock situation where the base pod gang pods will never start as the ComputeDomain resource on which they have a dependency never gets Ready and since based pod gang pods do not start the scaled pod gang pods will continue to be schedule gated.

This PR resolves this deadlock by changing the predicate to remove the scheduling gates for pods from scaled pod gangs.
Instead of waiting for the pods to be ready (num readyPods >= minAvailable) it now only waits for (num scheduledPods >= minAvailable).

Which issue(s) this PR fixes:

Fixes #209

Special notes for your reviewer:

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

  for pods belonging to scaled pod gangs. Previously we used to check
  for ready pods. This is required to remove deadlock when using
  computing domain (DRA).

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
@unmarshall unmarshall changed the title * Check scheduled pods against minAvailable for SchedulingGate removal Remove deadlock when deploying PCS with ComputeDomain Oct 7, 2025
@unmarshall unmarshall added the kind/bug Categorizes issue or PR as related to a bug. label Oct 7, 2025
renormalize
renormalize previously approved these changes Oct 7, 2025
Comment thread operator/internal/controller/podclique/components/pod/syncflow.go Outdated
* Regenerated the api docs and crds

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
@julienmancuso

Copy link
Copy Markdown
Contributor

I know that it solves an important deadlock issue but I thought that waiting for the base pod gang to be ready was needed before launching the scale pod gang.

@sanjaychatterjee

Copy link
Copy Markdown
Collaborator

I know that it solves an important deadlock issue but I thought that waiting for the base pod gang to be ready was needed before launching the scale pod gang.

The original thinking was to ensure that gang-scheduled pods from the base podgang will need to become ready first, which would have ensured the scheduled pods avoided any faulty GPUs. However, that reasoning is too conservative since the scheduling on bad GPUs is essentially a concurrency issue between fault-detection and scheduler binding of pods. While it is possible, the chances of that happening are slim since the scheduler would update its own cache of healthy pool of resources on every scheduling cycle.

So, the new reasoning is to delegate any such fault-handling to the termination workflow. So, while this PR not only fixes the deadlock issue, it would also ensure correct behavior when faults are encountered concurrently while scheduling.

@unmarshall unmarshall merged commit 9640779 into ai-dynamo:main Oct 8, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Grove Schedule Gating Deadlocks with DRA on Blackwell

4 participants