Remove deadlock when deploying PCS with ComputeDomain by unmarshall · Pull Request #215 · ai-dynamo/grove

unmarshall · 2025-10-07T08:50:12Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

When using DRA ComputeDomain on NVL72 then it will enter into a deadlock for a PCS which has minAvailable < replicas for PCSG.

Consider the following scenario:

Create a PCS for disaggregated inference with prefills and decodes being modeled as separate PCSG.
Create a ComputeDomain with numNodes = 18
Ensure that your minAvailable < replicas for the prefill or/and decode PCSGs.

Lets assume that you have a total of 12 pods (across prefill and decode PCSGs) that are part of the base pod gang. These have a reference to compute domain. Grove operator will first try and schedule and start the base pod gang. When the based pod gang is ready, only then it will lift the scheduling gates for the pods belonging to the scaled pod gangs.

When based pod gang pods are started, they will not come up as ComputeDomain.Status.Nodes only lists 12 nodes and ComputeDomain.Status.Status is set to NotReady as it expects 18 nodes and currently there are only 12. Due to scheduling gates on the remaining 6 pods (belong to the scaled pod gang) kube-scheduler (or equivalent) never picks these pods to be scheduled.

As a result you end up into a deadlock situation where the base pod gang pods will never start as the ComputeDomain resource on which they have a dependency never gets Ready and since based pod gang pods do not start the scaled pod gang pods will continue to be schedule gated.

This PR resolves this deadlock by changing the predicate to remove the scheduling gates for pods from scaled pod gangs.
Instead of waiting for the pods to be ready (num readyPods >= minAvailable) it now only waits for (num scheduledPods >= minAvailable).

Which issue(s) this PR fixes:

Fixes #209

Special notes for your reviewer:

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

for pods belonging to scaled pod gangs. Previously we used to check for ready pods. This is required to remove deadlock when using computing domain (DRA). Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

* Regenerated the api docs and crds Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

julienmancuso · 2025-10-07T17:10:21Z

I know that it solves an important deadlock issue but I thought that waiting for the base pod gang to be ready was needed before launching the scale pod gang.

sanjaychatterjee · 2025-10-07T17:24:03Z

I know that it solves an important deadlock issue but I thought that waiting for the base pod gang to be ready was needed before launching the scale pod gang.

The original thinking was to ensure that gang-scheduled pods from the base podgang will need to become ready first, which would have ensured the scheduled pods avoided any faulty GPUs. However, that reasoning is too conservative since the scheduling on bad GPUs is essentially a concurrency issue between fault-detection and scheduler binding of pods. While it is possible, the chances of that happening are slim since the scheduler would update its own cache of healthy pool of resources on every scheduling cycle.

So, the new reasoning is to delegate any such fault-handling to the termination workflow. So, while this PR not only fixes the deadlock issue, it would also ensure correct behavior when faults are encountered concurrently while scheduling.

* Check scheduled pods against minAvailable for SchedulingGate removal

1c496ef

for pods belonging to scaled pod gangs. Previously we used to check for ready pods. This is required to remove deadlock when using computing domain (DRA). Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

unmarshall requested a review from sanjaychatterjee as a code owner October 7, 2025 08:50

unmarshall changed the title * Check scheduled pods against minAvailable for SchedulingGate removal Remove deadlock when deploying PCS with ComputeDomain Oct 7, 2025

unmarshall requested review from nvrohanv and renormalize October 7, 2025 08:52

unmarshall added the kind/bug Categorizes issue or PR as related to a bug. label Oct 7, 2025

renormalize previously approved these changes Oct 7, 2025

View reviewed changes

sanjaychatterjee reviewed Oct 7, 2025

View reviewed changes

Comment thread operator/internal/controller/podclique/components/pod/syncflow.go Outdated

* Update docstring for PodCliqueScalingGroupConfig.MinAvailable

0559530

* Regenerated the api docs and crds Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

unmarshall dismissed renormalize’s stale review via 0559530 October 7, 2025 16:58

sanjaychatterjee approved these changes Oct 7, 2025

View reviewed changes

julienmancuso approved these changes Oct 7, 2025

View reviewed changes

unmarshall merged commit 9640779 into ai-dynamo:main Oct 8, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove deadlock when deploying PCS with ComputeDomain#215

Remove deadlock when deploying PCS with ComputeDomain#215
unmarshall merged 2 commits into
ai-dynamo:mainfrom
unmarshall:dra

unmarshall commented Oct 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

julienmancuso commented Oct 7, 2025

Uh oh!

sanjaychatterjee commented Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

unmarshall commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

Uh oh!

julienmancuso commented Oct 7, 2025

Uh oh!

sanjaychatterjee commented Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

unmarshall commented Oct 7, 2025 •

edited

Loading