Skip to content

Use cdi device driver to handle nvidia --gpus requests#50228

Merged
thaJeztah merged 2 commits intomoby:masterfrom
elezar:map-nvidia-requests-to-cdi
Jan 20, 2026
Merged

Use cdi device driver to handle nvidia --gpus requests#50228
thaJeztah merged 2 commits intomoby:masterfrom
elezar:map-nvidia-requests-to-cdi

Conversation

@elezar
Copy link
Contributor

@elezar elezar commented Jun 18, 2025

- What I did

Updated the nvidia device driver to handle GPU requests (e.g. the --gpus flag) using the cdi device driver if available, falling back to the existing prestart hook-based approach.

- How I did it

I renamed the existing nvidia device driver to nvidia.runtime-hook and added a new device driver nvidia.cdi if the nvidia-container-runtime-hook and nvidia-cdi-hook binaries are present on the system, respectively. For the nvidia.cdi device driver, the update spec handler converts the incoming device request to CDI device names using a predefined device kind (e.g. --gpus all will map to nvidia.com/gpu=all) and forwards this request to the cdi handler if defined.

I then construct an nvidia driver with an update spec handler that attempts to apply the update for the available drivers in the order: [nvidia.cdi, nvidia.runtime-hook].

Note: When the changes from #50717 are merged, this would also allow users to explicitly select the nvidia.runtime-hook driver.

- How to verify it

  1. Ensure that the --gpus flag continues to work on systems where the nvidia-container-runtime-hook is not available.

    • Install the nvidia-container-toolkit-base package to install the nvidia-ctk and nvidia-cdi-hook binaries.
    • Ensure that the nvidia-container-runtime-hook and amd-container-runtime are not available on the PATH.
    • Generate a CDI specification for all available nvidia devices by running:
      sudo nvidia-ctk cdi generate --output /var/run/cdi/nvidia.yaml
      
    • Check the environment of the generated container:
      $ docker run --rm -ti --gpus=all ubuntu env | grep NVIDIA
      NVIDIA_VISIBLE_DEVICES=void
      
      (this environment variable is explicitly defined in the CDI spec for the `nvidia.com/gpu=all device)
    • Run a container requesting a single GPU:
      $ docker run --rm -ti --gpus=1 ubuntu nvidia-smi -L
      GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
      
  2. Confirm that CDI injection fails if no CDI spec is found:

    • Remove all nvidia.com/gpu CDI specs:
      $ sudo rm -f /var/run/cdi/nvidia.yaml /etc/cdi/nvidia.yaml
      $ nvidia-ctk cdi list
      INFO[0000] Found 0 CDI devices
      
    • Run a GPU container:
      $ docker run --rm -ti --gpus=1 ubuntu nvidia-smi -L
      docker: Error response from daemon: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=0
      
      Run 'docker run --help' for more information
      
  3. Check that we fallback to nvidia-container-runtime-hook-based injection if CDI injection fails as above.

    • Install the nvidia-container-toolkit package in ensure that the nvidia-container-runtime-hook executable is present.
    • Confirm that we still have no CDI specs defined:
      $ nvidia-ctk cdi list
      INFO[0000] Found 0 CDI devices
      
    • Check the environment of a GPU container:
      $ docker run --rm -ti --gpus=all ubuntu env | grep NVIDIA
      NVIDIA_VISIBLE_DEVICES=all
      
    • Run nvidia-smi in a container requesting a single GPU:
      $ docker run --rm -ti --gpus=1 ubuntu nvidia-smi -L
      GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
      
  4. Confirm that CDI injection takes precedence over hook-based injection

    • (re)Generate the CDI specs for nvidia devices:
      sudo nvidia-ctk cdi generate --output /var/run/cdi/nvidia.yaml
      
    • Confirm the presence of CDI device specs:
      $ nvidia-ctk cdi list | grep all
      INFO[0000] Found 17 CDI devices
      nvidia.com/gpu=all
      
    • Check the environment of a GPU container:
      $ docker run --rm -ti --gpus=all ubuntu env | grep NVIDIA
      NVIDIA_VISIBLE_DEVICES=void
      
    • Run a GPU container requesting a single GPU:
      $ docker run --rm -ti --gpus=1 ubuntu nvidia-smi -L
      GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
      
    • Run a GPU container requesting more GPUs:
      $ docker run --rm -ti --gpus=4 ubuntu nvidia-smi -L
      GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
      GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
      GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
      GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2)
      
    • Run a GPU container requesting a specific GPU by index:
      $ docker run --rm -ti --gpus="device=2" ubuntu nvidia-smi -L
      GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
      
    • Run a GPU container requesting a specific GPU by uuid:
      $ docker run --rm -ti --gpus="device=GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54" ubuntu nvidia-smi -L
      GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
      

- Human readable description for the release notes

Handle `--gpus` requests for NVIDIA devices using CDI if possible

- A picture of a cute animal (not mandatory but encouraged)

@crazy-max
Copy link
Member

Related to our discussion in docker/buildx#3320, I wonder if it should be the other way around and use first nvidia-cdi-hook if available, otherwise fallback to container runtime. WDYT?

@elezar
Copy link
Contributor Author

elezar commented Aug 13, 2025

Related to our discussion in docker/buildx#3320, I wonder if it should be the other way around and use first nvidia-cdi-hook if available, otherwise fallback to container runtime. WDYT?

Could you elaborate on what you mean by use nvidia-cdi-hook? The intent here is to treat the --gpus request the same as an equivalent --device=nvidia.com/gpu= request. The nvidia-cdi-hook is not involved in this process, but is generally referenced in the generated CDI specs.

Update: Sorry, looking at the diff now, I think I understand. The way it is implemented the presence of the nvidia-container-runtime-hook would take precedence over the nvidia-cdi-hook which is being used as the trigger for mapping to a CDI device request.

Yes, this should change.

@elezar elezar force-pushed the map-nvidia-requests-to-cdi branch 3 times, most recently from 6ea74f7 to 7aeb6d3 Compare August 13, 2025 13:13
}
return incompatibleDeviceRequest{dev.req.Driver, dev.req.Capabilities}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question would be whether we wanted to check whether the requested devices are known at this stage and only then invoke the cdiDeviceDriver. Alternatively, we could simplyfy the fallback logic to always fallback to the setNvidiaGPUs handler for all errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for the fallback? I don't have a strong opinion on that; any preference on your side? pros/cons?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check for know devices. I don't want to get into a state where a user can't run a container because there is a problem with the generation of the CDI specs -- or they are using a version of the NVIDIA Container Toolkit that doesn't yet generate the CDI specs. Once these are more common, then we can consider changing this behaviour in which case we would probably remove the hook-based injection entirely.

@elezar elezar force-pushed the map-nvidia-requests-to-cdi branch from 7aeb6d3 to 3386805 Compare August 14, 2025 12:05
@ArangoGutierrez
Copy link

/cc

@elezar elezar marked this pull request as ready for review August 14, 2025 12:45
@elezar elezar force-pushed the map-nvidia-requests-to-cdi branch from 3386805 to c28f616 Compare August 14, 2025 12:56
@elezar
Copy link
Contributor Author

elezar commented Aug 14, 2025

@thaJeztah I think I have spent enough time playing with the first pass of this. Some questions that I would have is what is required to allow users to opt-out of this behaviour? I think the detection should be robust enough to handle various combiniations, but it may be good to make things more explicit.

@elezar elezar changed the title [WIP] Map nvidia requests to cdi Use cdi device driver to handle nvidia --gpus requests Aug 14, 2025
@thaJeztah thaJeztah added this to the 29.0.0 milestone Aug 14, 2025
@elezar elezar force-pushed the map-nvidia-requests-to-cdi branch from c28f616 to da71cc2 Compare August 15, 2025 08:27
@vvoland vvoland modified the milestones: 29.0.0, 29.1.0 Nov 10, 2025
@vvoland vvoland modified the milestones: 29.1.0, 29.2.0 Nov 26, 2025
@vvoland vvoland modified the milestones: 29.2.0, 29.3.0 Dec 15, 2025
@vvoland vvoland added kind/refactor PR's that refactor, or clean-up code area/cdi labels Jan 8, 2026
@thaJeztah
Copy link
Member

Could you do a rebase to have a fresh run of CI?

@elezar elezar force-pushed the map-nvidia-requests-to-cdi branch from fff0e36 to 109cff0 Compare January 9, 2026 11:33
@elezar
Copy link
Contributor Author

elezar commented Jan 9, 2026

Could you do a rebase to have a fresh run of CI?

done.

@thaJeztah
Copy link
Member

Probably unrelated; looks like we may have a new flaky test? Aded in 10c0fc4 / #51675

=== Failed
=== FAIL: amd64.docker.docker.integration.daemon.nri TestNRIContainerCreateAddMount/mount/bind/ro (1.62s)
time="2026-01-09T11:56:20Z" level=info msg="Created plugin 00-nri-test-plugin (test.main, handles RunPodSandbox,StopPodSandbox,RemovePodSandbox,CreateContainer,PostCreateContainer,StartContainer,PostStartContainer,UpdateContainer,PostUpdateContainer,StopContainer,RemoveContainer)"
time="2026-01-09T11:56:20Z" level=info msg="Registering plugin 00-nri-test-plugin..."
time="2026-01-09T11:56:20Z" level=info msg="Configuring plugin 00-nri-test-plugin for runtime docker/dev..."
time="2026-01-09T11:56:20Z" level=info msg="Connected to docker/dev..." nri-plugin=00-nri-test-plugin
time="2026-01-09T11:56:20Z" level=info msg="Subscribing plugin 00-nri-test-plugin (test.main) for events RunPodSandbox,StopPodSandbox,RemovePodSandbox,CreateContainer,PostCreateContainer,StartContainer,PostStartContainer,UpdateContainer,PostUpdateContainer,StopContainer,RemoveContainer"
time="2026-01-09T11:56:20Z" level=info msg="Started plugin 00-nri-test-plugin..."
time="2026-01-09T11:56:21Z" level=info msg="Synchronized state with the runtime (1 pods, 1 containers)..." nri-plugin=00-nri-test-plugin
    nri_test.go:245: assertion failed: 1 (res.ExitCode int) != 0 (tc.expMountRead int)
    nri_test.go:246: assertion failed: 
        --- ←
        +++ →
        @@ -1 +1,2 @@
        +hello
         
        
time="2026-01-09T11:56:22Z" level=info msg="Stopping plugin 00-nri-test-plugin..."
time="2026-01-09T11:56:22Z" level=info msg="Connection to the runtime lost." nri-plugin=00-nri-test-plugin
    --- FAIL: TestNRIContainerCreateAddMount/mount/bind/ro (1.62s)

=== FAIL: amd64.docker.docker.integration.daemon.nri TestNRIContainerCreateAddMount (6.59s)

Comment on lines +59 to +93
hasNVIDIAExecutables := make(map[string]struct{})
for _, e := range []string{nvidiaContainerRuntimeHookExecutableName, nvidiaCDIHookExecutableName} {
if _, err := exec.LookPath(e); err != nil {
continue
}
hasNVIDIAExecutables[e] = struct{}{}
}

if len(hasNVIDIAExecutables) == 0 {
return nil
}

nvidiaDrivers := make(map[string]*deviceDriver)
// Register a driver specific to the nvidia-container-runtime-hook if present.
// This has no capabilities associated with it so as to not inadvertently
// match requests.
if _, ok := hasNVIDIAExecutables[nvidiaContainerRuntimeHookExecutableName]; ok {
nvidiaDrivers["nvidia.runtime-hook"] = &deviceDriver{
capset: nil,
updateSpec: setNvidiaGPUs,
}
}

// Register a driver specific to CDI if present.
// This has no capabilities associated with it so as to not inadvertently
// match requests.
if _, ok := hasNVIDIAExecutables[nvidiaCDIHookExecutableName]; ok {
nvidiaDrivers["nvidia.cdi"] = &deviceDriver{
capset: nil,
updateSpec: (&cdiDeviceInjector{
defaultCDIDeviceKind: "nvidia.com/gpu",
}).injectDevices,
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit; I guess we can skip the hasNVIDIAExecutables intermediate map here, and directly use the nvidiaDrivers map?

	nvidiaDrivers := make(map[string]*deviceDriver)
	if _, err := exec.LookPath(nvidiaCDIHookExecutableName); err == nil {
		// Register a driver specific to CDI if present.
		// This has no capabilities associated to not inadvertently match requests.
		nvidiaDrivers["nvidia.cdi"] = &deviceDriver{
			updateSpec: (&cdiDeviceInjector{
				defaultCDIDeviceKind: "nvidia.com/gpu",
			}).injectDevices,
		}
	}

	if _, err := exec.LookPath(nvidiaContainerRuntimeHookExecutableName); err == nil {
		// Register a driver specific to the nvidia-container-runtime-hook if present.
		// This has no capabilities associated to not inadvertently match requests.
		nvidiaDrivers["nvidia.runtime-hook"] = &deviceDriver{
			updateSpec: setNvidiaGPUs,
		}
	}

	if len(nvidiaDrivers) == 0 {
		return nil
	}

Perhaps even the composite could be rolled up into the above, but it might be clearer to construct that below to express the "order is important here".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal. I reworked this a bit and I believe it's cleaner now.

Comment on lines +240 to +244
// If the cdi device driver is not available then we return an error.
cdiDeviceDriver := deviceDrivers["cdi"]
if cdiDeviceDriver == nil {
return fmt.Errorf("no CDI device driver registered: %w", incompatibleDeviceRequest{dev.req.Driver, dev.req.Capabilities})
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this check is more lightweight than the code above; perhaps move this to the start of the method to allow an early return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the question is whether we want to raise an error if no CDI driver is registered for empty requests. I think we can meet in the middle, where we quick-return if the requested IDs are nil and then only construct cdiDeviceIDs if the driver is registered.

Comment on lines +248 to +258
cdiRequest := dev.req
cdiRequest.Count = 0
cdiRequest.Options = nil
cdiRequest.DeviceIDs = cdiDeviceIDs

cdiDeviceInstance := deviceInstance{
req: cdiRequest,
selectedCaps: nil,
}

return cdiDeviceDriver.updateSpec(s, &cdiDeviceInstance)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the container.DeviceRequest struct, I think the only thing we're preserving from the dev.req is the Driver field and the Capabilities; Wondering if it'd be more clear to construct a fresh request here (but not sure if that's correct; just from glancing over);

	return cdiDeviceDriver.updateSpec(s, &deviceInstance{
		req: container.DeviceRequest{
			Driver:       dev.req.Driver,
			DeviceIDs:    cdiDeviceIDs,
			Capabilities: dev.req.Capabilities,
		},
		selectedCaps: nil,
	})

I don't think there's a need to de-reference the Capabilities? If there is, then perhaps a slices.Clone (but not sure that really helps as it's a slice of slices);

	return cdiDeviceDriver.updateSpec(s, &deviceInstance{
		req: container.DeviceRequest{
			Driver:       dev.req.Driver,
			DeviceIDs:    cdiDeviceIDs,
			Capabilities: slices.Clone(dev.req.Capabilities),
		},
		selectedCaps: nil,
	})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

// normalizeDeviceID ensures that the specified deviceID is a fully-qualified
// CDI device name.
// If the deviceID is already a fully-qualified CDI device name it is returned
// as-is, otherwise, the defailt CDI device kind (vendor/class) is used to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo; defailt -> default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines +268 to +269
parts := strings.SplitN(deviceID, "=", 2)
if len(parts) == 2 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps strings.Cut so that we don't allocate a new slice just for checking;

	_, _, ok := strings.Cut(deviceID, "=")
	if ok {
		return deviceID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines -32 to -33
// NvidiaCLI is the path to the Nvidia helper binary
const NvidiaCLI = "nvidia-container-cli"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting; looks like this was using a different binary than we have in the Moby code; is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the ctr implementation was invoking the nvidia-container-cli directly in a hook instead of using the nvidia-container-runtime-hook. See https://github.com/containerd/containerd/pull/12537/changes#diff-59afc480d0fe54b9edf0bca811470767f059431f5e65cd89c297ffcdbb27ef62L90 where I also removed this from the ctr code.

This inconsistency is one of the motivations for switching to CDI -- which should be consistent across all runtimes.

}
return incompatibleDeviceRequest{dev.req.Driver, dev.req.Capabilities}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for the fallback? I don't have a strong opinion on that; any preference on your side? pros/cons?

@elezar elezar force-pushed the map-nvidia-requests-to-cdi branch 2 times, most recently from 4c30d65 to daa1213 Compare January 16, 2026 18:36
@thaJeztah thaJeztah modified the milestones: 29.3.0, 29.2.0 Jan 19, 2026
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the map-nvidia-requests-to-cdi branch from daa1213 to f918651 Compare January 19, 2026 13:10
Copy link
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elezar
Copy link
Contributor Author

elezar commented Jan 19, 2026

LGTM

thanks for the review @vvoland. Let me know if I need to rebase again to apply the fixup commit.

@vvoland vvoland requested a review from thaJeztah January 19, 2026 16:09
This change updates the device driver for --gpu requests to
handle --gpu requests as CDI device requests. This is only done
if the nvidia-cdi-hook binary is present and falls back to
the nvidia-container-runtime-hook if available.

If a cdi deviceDriver is installed at the point where a request should
be handled, the incoming device request is updated to refer to
fully-qualified CDI device names and forwarded to the cdi
deviceDriver.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I'll do a quick squash of the "fixup" commit

@thaJeztah thaJeztah force-pushed the map-nvidia-requests-to-cdi branch from f918651 to 86f122a Compare January 20, 2026 13:36
@thaJeztah thaJeztah merged commit 5e01275 into moby:master Jan 20, 2026
54 checks passed
@elezar elezar deleted the map-nvidia-requests-to-cdi branch January 20, 2026 14:31
@vvoland vvoland removed the backlog label Jan 21, 2026
@vvoland vvoland moved this from New to Complete in 🔦 Maintainer spotlight Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants