Skip to content

Grove Schedule Gating Deadlocks with DRA on Blackwell #209

@nvrohanv

Description

@nvrohanv

What happened?

On Blackwell, ComputeDomains require all pods that claim them to be scheduled before the domain itself can become ready. Since ComputeDomains are fixed in size and cannot autoscale, if I launch a 3p1d DeepSeek-R1 model with Dynamo on a full NVL72, I need to define the ComputeDomain across all 18 nodes.

However, Grove handles 3p1d differently: it assigns one prefill and one decode pod to the base PodGang, while the remaining two prefill workers are gated until the base PodGang reaches the Ready state. This creates a deadlock. The base PodGang (1p1d) cannot transition to Ready because the ComputeDomain itself is not ready until all pods for the full 3p1d configuration are scheduled. Meanwhile, the additional prefill workers cannot be scheduled until the base PodGang is marked Ready.

What did you expect to happen?

Instead of requiring non-BasePodGang pods to wait until the BasePodGang is ready, the scheduling gate should lift as soon as the BasePodGang is scheduled. The risk is that if the additional prefill workers start too early, they might consume resources needed for the BasePodGang (the MinAvailable set) to succeed. For example, the BasePodGang could be scheduled onto a node that then fails during the lengthy startup process (often 30–60 minutes for large models). If other pods have already occupied the remaining capacity, there may not be enough resources to reschedule the BasePodGang. Even so, this seems like an acceptable tradeoff, since node health and recovery should ideally be managed by Kubernetes.

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions