What you would like to be added?
Automatic MNNVL support (MultiNode NVLink) needs to be added to grove.
What is MNNVL?
MNNVL (Multi-Node NVLink) is NVIDIA's technology that enables GPUs in different servers to communicate at full NVLink bandwidth through NVIDIA NVLink Switches, transforming an entire rack into a single, unified GPU fabric. This is particularly important for systems like the NVIDIA GB200 NVL72
What are ComputeDomains?
A ComputeDomain is an abstraction for robust and secure Multi-Node NVLink that guarantees MNNVL-reachability between pods that are in the ComputeDomain, and secure isolation from other pods that are not in the ComputeDomain. When a workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory securely via NVLink among all pods that comprise the workload installation
Why is this needed?
Need to support systems like the NVIDIA GB200 NVL72
What you would like to be added?
Automatic MNNVL support (MultiNode NVLink) needs to be added to grove.
What is MNNVL?
MNNVL (Multi-Node NVLink) is NVIDIA's technology that enables GPUs in different servers to communicate at full NVLink bandwidth through NVIDIA NVLink Switches, transforming an entire rack into a single, unified GPU fabric. This is particularly important for systems like the NVIDIA GB200 NVL72
What are ComputeDomains?
A ComputeDomain is an abstraction for robust and secure Multi-Node NVLink that guarantees MNNVL-reachability between pods that are in the ComputeDomain, and secure isolation from other pods that are not in the ComputeDomain. When a workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory securely via NVLink among all pods that comprise the workload installation
Why is this needed?
Need to support systems like the NVIDIA GB200 NVL72