Skip to content

Enhancement: MNNVL V2, ComputeDomain Injection Via Annotation #417

@nvrohanv

Description

@nvrohanv

Summary

Introduce a default, annotation-based mechanism for injecting NVIDIA ComputeDomain membership into pods created from a PodClique. This provides a low-friction and explicit way for users to control MNNVL participation without requiring verbose ComputeDomain specifications, while maintaining clear precedence and error semantics when interacting with existing automatic MNNVL setup mechanisms in Grove.


Background

Currently the first implementation of automatic MNNVL setup in Grove is underway and nearing completion.

This implementation:

  • Assumes NVIDIA DRA drivers are installed and correctly configured on all nodes
  • Requires explicit opt-in by the cluster administrator via operator configuration
  • Preserves the familiar “it just works” NVLink experience, where:
    • NVLink automatically exists between GPUs on NVLink-enabled nodes
    • Users do not need to manually define domains or topology for same-node NVLink

This cluster-level automatic behavior is necessary and should remain supported.

However, this model alone is insufficient for a few scenarios:

  • Users may want fine-grained control over MNNVL participation, where only a subset of pods (or PodCliques) within a PodCliqueSet replica should participate in a shared MNNVL domain
  • Grove needs a way to provide some level of automatic ComputeDomain injection in default configurations, without requiring fully automatic cluster-wide MNNVL enablement
  • Heterogeneous clusters with mixed hardware from multiple vendors may not be able to install NVIDIA DRA drivers on all nodes, making cluster-wide automatic MNNVL infeasible while still requiring explicit MNNVL support on NVIDIA-capable subsets

Consequently Grove needs to provide a simple mechanism to:

  • Explicitly control which pods participate in a given MNNVL domain
  • Avoid authoring verbose, PVC-style ComputeDomain resources
  • Express intent in a way that composes naturally with Grove primitives and heterogeneous cluster environments

Proposal

Add annotation-based ComputeDomain injection at the PodClique level, using a vendor-scoped annotation.

Core Behavior

If a PodClique includes the annotation:

nvidia.com/computedomain: <name>

then pods created from that PodClique will have the following behavior applied:

  • The pods claim a ComputeDomain with the specified name
  • The pods participate in the same MNNVL domain for NVLink / interconnect setup
  • If the ComputeDomain the pods claim is not created, the Operator will create it

The exact set of pods affected is defined by the PodClique and PodCliqueSet replica semantics, rather than being implicitly inferred.

This provides a minimal and explicit contract for MNNVL participation without requiring users to define full ComputeDomain specifications, while allowing users to opt into MNNVL on a per-PodClique basis even when broader automation is undesirable or unavailable.


Defaults and Precedence

Default Availability

  • Annotation-based ComputeDomain injection must be enabled by default
  • It must be available regardless of whether cluster-level automatic MNNVL is enabled

This ensures that operators can provide baseline ComputeDomain injection behavior out of the box, while still allowing users to selectively opt into MNNVL only where appropriate.


Interaction with Automatic MNNVL Configuration

Cluster Admin Has Not Opted In to Automatic MNNVL

  • Users may freely use the nvidia.com/computedomain annotation
  • Annotation-based ComputeDomain injection is honored unconditionally

This enables MNNVL usage in clusters where:

  • Automatic MNNVL cannot be enabled globally
  • NVIDIA DRA drivers are present only on a subset of nodes
  • Multiple accelerator vendors coexist within the same cluster

Cluster Admin Has Opted In to Automatic MNNVL

When automatic MNNVL is enabled at the cluster level:

  • Automatic ComputeDomain injection applies by default
  • The user API allows opting out of automatic MNNVL injection at the PodCliqueSet replica level

Valid Usage

  • If a user opts out of automatic MNNVL injection for a given PodCliqueSet replica:
    • The user may specify the nvidia.com/computedomain annotation
    • The annotation must be honored and result in ComputeDomain injection

This allows users to retain fine-grained control over MNNVL participation when the default automatic behavior is too coarse for a given workload.

Invalid Usage (Must Error)

  • If a user does not opt out of automatic MNNVL injection and specifies the nvidia.com/computedomain annotation:
    • This configuration is invalid
    • The system must reject the workload
    • The error must clearly explain that:
      • Automatic MNNVL injection is enabled for this PodCliqueSet replica
      • Explicit ComputeDomain annotations require opting out of automatic injection

This enforces a clear separation between automatic cluster-driven behavior and explicit user-driven intent, and avoids ambiguous or conflicting configuration.


Rational For Proposed Vendor Specific Annotation Key

The ComputeDomain resource and its semantics are currently NVIDIA-specific, used to support MNNVL / NVLink setup via NVIDIA DRA.

To make ownership explicit and avoid collisions with other accelerator vendors or future abstractions, the annotation used to request ComputeDomain injection must be vendor-scoped:

nvidia.com/computedomain: <name>

Multi-Vendor Extensibility and Future Support

Although this proposal introduces a NVIDIA-scoped annotation and implementation, the design and implementation must not assume that Grove only supports NVIDIA ComputeDomains.

Specifically:

  • The injection mechanism should be structured so that:
    • Vendor-specific ComputeDomain implementations are pluggable
    • Annotation handling can be extended to support equivalent abstractions for other hardware vendors
  • The use of a vendor-scoped annotation allows:
    • MNNVL and NVIDIA DRA support where available
    • Coexistence with other accelerator types in heterogeneous clusters

This proposal explicitly does not preclude Grove from supporting equivalent domain or interconnect abstractions for other accelerators as they become available.


Goals

  • Preserve the zero-configuration NVLink experience under automatic MNNVL
  • Provide fine-grained user control over MNNVL participation at the PodClique level
  • Enable partial and selective MNNVL adoption within a PodCliqueSet replica
  • Support heterogeneous clusters where cluster-wide automatic MNNVL is not possible
  • Avoid forcing users into verbose ComputeDomain definitions
  • Make MNNVL participation obvious, intentional, and composable with PodClique semantics
  • Enforce a clear and debuggable intent hierarchy between cluster and user configuration
  • Ensure Grove remains extensible across accelerator vendors

Non-Goals

  • This enhancement does not deprecate:
    • The current automatic MNNVL implementation
    • Fully specified ComputeDomain resources
  • This enhancement does not define:
    • Scheduling or topology placement policy
    • Lock Grove into only supporting Nvidia's CR for scale-up high-bandwidth domain

Metadata

Metadata

Assignees

Labels

No labels
No labels
No fields configured for Feature.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions