Skip to content

Add automatic support for MNNVL #270

@julienmancuso

Description

@julienmancuso

What you would like to be added?

Automatic MNNVL support (MultiNode NVLink) needs to be added to grove.

What is MNNVL?
MNNVL (Multi-Node NVLink) is NVIDIA's technology that enables GPUs in different servers to communicate at full NVLink bandwidth through NVIDIA NVLink Switches, transforming an entire rack into a single, unified GPU fabric. This is particularly important for systems like the NVIDIA GB200 NVL72

What are ComputeDomains?
A ComputeDomain is an abstraction for robust and secure Multi-Node NVLink that guarantees MNNVL-reachability between pods that are in the ComputeDomain, and secure isolation from other pods that are not in the ComputeDomain. When a workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory securely via NVLink among all pods that comprise the workload installation

Why is this needed?

Need to support systems like the NVIDIA GB200 NVL72

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions