Grove Schedule Gating Deadlocks with DRA on Blackwell

### What happened?

On Blackwell, ComputeDomains require all pods that claim them to be scheduled before the domain itself can become ready. Since ComputeDomains are fixed in size and cannot autoscale, if I launch a 3p1d DeepSeek-R1 model with Dynamo on a full NVL72, I need to define the ComputeDomain across all 18 nodes.

However, Grove handles 3p1d differently: it assigns one prefill and one decode pod to the base PodGang, while the remaining two prefill workers are gated until the base PodGang reaches the Ready state. This creates a deadlock. The base PodGang (1p1d) cannot transition to Ready because the ComputeDomain itself is not ready until all pods for the full 3p1d configuration are scheduled. Meanwhile, the additional prefill workers cannot be scheduled until the base PodGang is marked Ready.

### What did you expect to happen?

Instead of requiring non-BasePodGang pods to wait until the BasePodGang is ready, the scheduling gate should lift as soon as the BasePodGang is scheduled. The risk is that if the additional prefill workers start too early, they might consume resources needed for the BasePodGang (the MinAvailable set) to succeed. For example, the BasePodGang could be scheduled onto a node that then fails during the lengthy startup process (often 30–60 minutes for large models). If other pods have already occupied the remaining capacity, there may not be enough resources to reschedule the BasePodGang. Even so, this seems like an acceptable tradeoff, since node health and recovery should ideally be managed by Kubernetes.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grove Schedule Gating Deadlocks with DRA on Blackwell #209

What happened?

What did you expect to happen?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Grove Schedule Gating Deadlocks with DRA on Blackwell #209

Description

What happened?

What did you expect to happen?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions