Use cdi device driver to handle nvidia --gpus requests by elezar · Pull Request #50228 · moby/moby

elezar · 2025-06-18T21:10:27Z

- What I did

Updated the nvidia device driver to handle GPU requests (e.g. the --gpus flag) using the cdi device driver if available, falling back to the existing prestart hook-based approach.

- How I did it

I renamed the existing nvidia device driver to nvidia.runtime-hook and added a new device driver nvidia.cdi if the nvidia-container-runtime-hook and nvidia-cdi-hook binaries are present on the system, respectively. For the nvidia.cdi device driver, the update spec handler converts the incoming device request to CDI device names using a predefined device kind (e.g. --gpus all will map to nvidia.com/gpu=all) and forwards this request to the cdi handler if defined.

I then construct an nvidia driver with an update spec handler that attempts to apply the update for the available drivers in the order: [nvidia.cdi, nvidia.runtime-hook].

Note: When the changes from #50717 are merged, this would also allow users to explicitly select the nvidia.runtime-hook driver.

- How to verify it

Ensure that the --gpus flag continues to work on systems where the nvidia-container-runtime-hook is not available.
- Install the nvidia-container-toolkit-base package to install the nvidia-ctk and nvidia-cdi-hook binaries.
- Ensure that the nvidia-container-runtime-hook and amd-container-runtime are not available on the PATH.
- Generate a CDI specification for all available nvidia devices by running:
```
sudo nvidia-ctk cdi generate --output /var/run/cdi/nvidia.yaml
```
- Check the environment of the generated container:
```
$ docker run --rm -ti --gpus=all ubuntu env | grep NVIDIA
NVIDIA_VISIBLE_DEVICES=void
```
  (this environment variable is explicitly defined in the CDI spec for the `nvidia.com/gpu=all device)
- Run a container requesting a single GPU:
```
$ docker run --rm -ti --gpus=1 ubuntu nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
```

Confirm that CDI injection fails if no CDI spec is found:

Remove all nvidia.com/gpu CDI specs:

$ sudo rm -f /var/run/cdi/nvidia.yaml /etc/cdi/nvidia.yaml
$ nvidia-ctk cdi list
INFO[0000] Found 0 CDI devices

Run a GPU container:

$ docker run --rm -ti --gpus=1 ubuntu nvidia-smi -L
docker: Error response from daemon: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=0

Run 'docker run --help' for more information

Check that we fallback to nvidia-container-runtime-hook-based injection if CDI injection fails as above.
- Install the nvidia-container-toolkit package in ensure that the nvidia-container-runtime-hook executable is present.
- Confirm that we still have no CDI specs defined:
```
$ nvidia-ctk cdi list
INFO[0000] Found 0 CDI devices
```
- Check the environment of a GPU container:
```
$ docker run --rm -ti --gpus=all ubuntu env | grep NVIDIA
NVIDIA_VISIBLE_DEVICES=all
```
- Run nvidia-smi in a container requesting a single GPU:
```
$ docker run --rm -ti --gpus=1 ubuntu nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
```

Confirm that CDI injection takes precedence over hook-based injection

(re)Generate the CDI specs for nvidia devices:

sudo nvidia-ctk cdi generate --output /var/run/cdi/nvidia.yaml

Confirm the presence of CDI device specs:

$ nvidia-ctk cdi list | grep all
INFO[0000] Found 17 CDI devices
nvidia.com/gpu=all

Check the environment of a GPU container:

$ docker run --rm -ti --gpus=all ubuntu env | grep NVIDIA
NVIDIA_VISIBLE_DEVICES=void

Run a GPU container requesting a single GPU:

$ docker run --rm -ti --gpus=1 ubuntu nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)

Run a GPU container requesting more GPUs:

$ docker run --rm -ti --gpus=4 ubuntu nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2)

Run a GPU container requesting a specific GPU by index:

$ docker run --rm -ti --gpus="device=2" ubuntu nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)

Run a GPU container requesting a specific GPU by uuid:

$ docker run --rm -ti --gpus="device=GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54" ubuntu nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)

- Human readable description for the release notes

Handle `--gpus` requests for NVIDIA devices using CDI if possible

- A picture of a cute animal (not mandatory but encouraged)

crazy-max · 2025-08-13T07:42:13Z

Related to our discussion in docker/buildx#3320, I wonder if it should be the other way around and use first nvidia-cdi-hook if available, otherwise fallback to container runtime. WDYT?

elezar · 2025-08-13T09:00:15Z

Related to our discussion in docker/buildx#3320, I wonder if it should be the other way around and use first nvidia-cdi-hook if available, otherwise fallback to container runtime. WDYT?

Could you elaborate on what you mean by use nvidia-cdi-hook? The intent here is to treat the --gpus request the same as an equivalent --device=nvidia.com/gpu= request. The nvidia-cdi-hook is not involved in this process, but is generally referenced in the generated CDI specs.

Update: Sorry, looking at the diff now, I think I understand. The way it is implemented the presence of the nvidia-container-runtime-hook would take precedence over the nvidia-cdi-hook which is being used as the trigger for mapping to a CDI device request.

Yes, this should change.

elezar · 2025-08-13T13:23:38Z

daemon/devices_nvidia_linux.go

+		}
+		return incompatibleDeviceRequest{dev.req.Driver, dev.req.Capabilities}
+	}
+


One question would be whether we wanted to check whether the requested devices are known at this stage and only then invoke the cdiDeviceDriver. Alternatively, we could simplyfy the fallback logic to always fallback to the setNvidiaGPUs handler for all errors.

This is only for the fallback? I don't have a strong opinion on that; any preference on your side? pros/cons?

I think we should check for know devices. I don't want to get into a state where a user can't run a container because there is a problem with the generation of the CDI specs -- or they are using a version of the NVIDIA Container Toolkit that doesn't yet generate the CDI specs. Once these are more common, then we can consider changing this behaviour in which case we would probably remove the hook-based injection entirely.

ArangoGutierrez · 2025-08-14T12:24:03Z

/cc

elezar · 2025-08-14T13:01:24Z

@thaJeztah I think I have spent enough time playing with the first pass of this. Some questions that I would have is what is required to allow users to opt-out of this behaviour? I think the detection should be robust enough to handle various combiniations, but it may be good to make things more explicit.

thaJeztah · 2026-01-08T19:15:07Z

Could you do a rebase to have a fresh run of CI?

elezar · 2026-01-09T11:33:45Z

Could you do a rebase to have a fresh run of CI?

done.

thaJeztah · 2026-01-12T12:23:15Z

Probably unrelated; looks like we may have a new flaky test? Aded in 10c0fc4 / #51675

=== Failed
=== FAIL: amd64.docker.docker.integration.daemon.nri TestNRIContainerCreateAddMount/mount/bind/ro (1.62s)
time="2026-01-09T11:56:20Z" level=info msg="Created plugin 00-nri-test-plugin (test.main, handles RunPodSandbox,StopPodSandbox,RemovePodSandbox,CreateContainer,PostCreateContainer,StartContainer,PostStartContainer,UpdateContainer,PostUpdateContainer,StopContainer,RemoveContainer)"
time="2026-01-09T11:56:20Z" level=info msg="Registering plugin 00-nri-test-plugin..."
time="2026-01-09T11:56:20Z" level=info msg="Configuring plugin 00-nri-test-plugin for runtime docker/dev..."
time="2026-01-09T11:56:20Z" level=info msg="Connected to docker/dev..." nri-plugin=00-nri-test-plugin
time="2026-01-09T11:56:20Z" level=info msg="Subscribing plugin 00-nri-test-plugin (test.main) for events RunPodSandbox,StopPodSandbox,RemovePodSandbox,CreateContainer,PostCreateContainer,StartContainer,PostStartContainer,UpdateContainer,PostUpdateContainer,StopContainer,RemoveContainer"
time="2026-01-09T11:56:20Z" level=info msg="Started plugin 00-nri-test-plugin..."
time="2026-01-09T11:56:21Z" level=info msg="Synchronized state with the runtime (1 pods, 1 containers)..." nri-plugin=00-nri-test-plugin
    nri_test.go:245: assertion failed: 1 (res.ExitCode int) != 0 (tc.expMountRead int)
    nri_test.go:246: assertion failed: 
        --- ←
        +++ →
        @@ -1 +1,2 @@
        +hello
         
        
time="2026-01-09T11:56:22Z" level=info msg="Stopping plugin 00-nri-test-plugin..."
time="2026-01-09T11:56:22Z" level=info msg="Connection to the runtime lost." nri-plugin=00-nri-test-plugin
    --- FAIL: TestNRIContainerCreateAddMount/mount/bind/ro (1.62s)

=== FAIL: amd64.docker.docker.integration.daemon.nri TestNRIContainerCreateAddMount (6.59s)

thaJeztah · 2026-01-12T12:58:25Z

daemon/devices_nvidia_linux.go

+	hasNVIDIAExecutables := make(map[string]struct{})
+	for _, e := range []string{nvidiaContainerRuntimeHookExecutableName, nvidiaCDIHookExecutableName} {
+		if _, err := exec.LookPath(e); err != nil {
+			continue
+		}
+		hasNVIDIAExecutables[e] = struct{}{}
+	}
+
+	if len(hasNVIDIAExecutables) == 0 {
+		return nil
+	}
+
+	nvidiaDrivers := make(map[string]*deviceDriver)
+	// Register a driver specific to the nvidia-container-runtime-hook if present.
+	// This has no capabilities associated with it so as to not inadvertently
+	// match requests.
+	if _, ok := hasNVIDIAExecutables[nvidiaContainerRuntimeHookExecutableName]; ok {
+		nvidiaDrivers["nvidia.runtime-hook"] = &deviceDriver{
+			capset:     nil,
+			updateSpec: setNvidiaGPUs,
+		}
+	}
+
+	// Register a driver specific to CDI if present.
+	// This has no capabilities associated with it so as to not inadvertently
+	// match requests.
+	if _, ok := hasNVIDIAExecutables[nvidiaCDIHookExecutableName]; ok {
+		nvidiaDrivers["nvidia.cdi"] = &deviceDriver{
+			capset: nil,
+			updateSpec: (&cdiDeviceInjector{
+				defaultCDIDeviceKind: "nvidia.com/gpu",
+			}).injectDevices,
+		}
+	}
+


Minor nit; I guess we can skip the hasNVIDIAExecutables intermediate map here, and directly use the nvidiaDrivers map?

nvidiaDrivers := make(map[string]*deviceDriver) if _, err := exec.LookPath(nvidiaCDIHookExecutableName); err == nil { // Register a driver specific to CDI if present. // This has no capabilities associated to not inadvertently match requests. nvidiaDrivers["nvidia.cdi"] = &deviceDriver{ updateSpec: (&cdiDeviceInjector{ defaultCDIDeviceKind: "nvidia.com/gpu", }).injectDevices, } } if _, err := exec.LookPath(nvidiaContainerRuntimeHookExecutableName); err == nil { // Register a driver specific to the nvidia-container-runtime-hook if present. // This has no capabilities associated to not inadvertently match requests. nvidiaDrivers["nvidia.runtime-hook"] = &deviceDriver{ updateSpec: setNvidiaGPUs, } } if len(nvidiaDrivers) == 0 { return nil }

Perhaps even the composite could be rolled up into the above, but it might be clearer to construct that below to express the "order is important here".

Thanks for the proposal. I reworked this a bit and I believe it's cleaner now.

thaJeztah · 2026-01-12T13:14:55Z

daemon/devices_nvidia_linux.go

+	// If the cdi device driver is not available then we return an error.
+	cdiDeviceDriver := deviceDrivers["cdi"]
+	if cdiDeviceDriver == nil {
+		return fmt.Errorf("no CDI device driver registered: %w", incompatibleDeviceRequest{dev.req.Driver, dev.req.Capabilities})
+	}


Looks like this check is more lightweight than the code above; perhaps move this to the start of the method to allow an early return?

I suppose the question is whether we want to raise an error if no CDI driver is registered for empty requests. I think we can meet in the middle, where we quick-return if the requested IDs are nil and then only construct cdiDeviceIDs if the driver is registered.

thaJeztah · 2026-01-12T13:17:42Z

daemon/devices_nvidia_linux.go

+	cdiRequest := dev.req
+	cdiRequest.Count = 0
+	cdiRequest.Options = nil
+	cdiRequest.DeviceIDs = cdiDeviceIDs
+
+	cdiDeviceInstance := deviceInstance{
+		req:          cdiRequest,
+		selectedCaps: nil,
+	}
+
+	return cdiDeviceDriver.updateSpec(s, &cdiDeviceInstance)


Looking at the container.DeviceRequest struct, I think the only thing we're preserving from the dev.req is the Driver field and the Capabilities; Wondering if it'd be more clear to construct a fresh request here (but not sure if that's correct; just from glancing over);

return cdiDeviceDriver.updateSpec(s, &deviceInstance{ req: container.DeviceRequest{ Driver: dev.req.Driver, DeviceIDs: cdiDeviceIDs, Capabilities: dev.req.Capabilities, }, selectedCaps: nil, })

I don't think there's a need to de-reference the Capabilities? If there is, then perhaps a slices.Clone (but not sure that really helps as it's a slice of slices);

return cdiDeviceDriver.updateSpec(s, &deviceInstance{ req: container.DeviceRequest{ Driver: dev.req.Driver, DeviceIDs: cdiDeviceIDs, Capabilities: slices.Clone(dev.req.Capabilities), }, selectedCaps: nil, })

thaJeztah · 2026-01-12T13:21:53Z

daemon/devices_nvidia_linux.go

+// normalizeDeviceID ensures that the specified deviceID is a fully-qualified
+// CDI device name.
+// If the deviceID is already a fully-qualified CDI device name it is returned
+// as-is, otherwise, the defailt CDI device kind (vendor/class) is used to


Typo; defailt -> default

thaJeztah · 2026-01-12T13:25:03Z

daemon/devices_nvidia_linux.go

+	parts := strings.SplitN(deviceID, "=", 2)
+	if len(parts) == 2 {


Perhaps strings.Cut so that we don't allocate a new slice just for checking;

_, _, ok := strings.Cut(deviceID, "=") if ok { return deviceID

thaJeztah · 2026-01-12T13:27:49Z

vendor/github.com/containerd/containerd/v2/contrib/nvidia/nvidia.go

-// NvidiaCLI is the path to the Nvidia helper binary
-const NvidiaCLI = "nvidia-container-cli"


Interesting; looks like this was using a different binary than we have in the Moby code; is that correct?

Yes, the ctr implementation was invoking the nvidia-container-cli directly in a hook instead of using the nvidia-container-runtime-hook. See https://github.com/containerd/containerd/pull/12537/changes#diff-59afc480d0fe54b9edf0bca811470767f059431f5e65cd89c297ffcdbb27ef62L90 where I also removed this from the ctr code.

This inconsistency is one of the motivations for switching to CDI -- which should be consistent across all runtimes.

thaJeztah · 2026-01-12T13:29:40Z

daemon/devices_nvidia_linux.go

+		}
+		return incompatibleDeviceRequest{dev.req.Driver, dev.req.Capabilities}
+	}
+


This is only for the fallback? I don't have a strong opinion on that; any preference on your side? pros/cons?

daemon/devices_nvidia_linux.go

Signed-off-by: Evan Lezar <elezar@nvidia.com>

vvoland

LGTM

elezar · 2026-01-19T14:37:36Z

LGTM

thanks for the review @vvoland. Let me know if I need to rebase again to apply the fixup commit.

This change updates the device driver for --gpu requests to handle --gpu requests as CDI device requests. This is only done if the nvidia-cdi-hook binary is present and falls back to the nvidia-container-runtime-hook if available. If a cdi deviceDriver is installed at the point where a request should be handled, the incoming device request is updated to refer to fully-qualified CDI device names and forwarded to the cdi deviceDriver. Signed-off-by: Evan Lezar <elezar@nvidia.com>

thaJeztah

LGTM

I'll do a quick squash of the "fixup" commit

elezar mentioned this pull request Jun 18, 2025

re-implement --gpus flag using CDI (was "AMD GPU support") #49824

Closed

elezar force-pushed the map-nvidia-requests-to-cdi branch from 78a7b0c to 7d75268 Compare June 18, 2025 21:20

elezar mentioned this pull request Aug 12, 2025

driver: mount wsl lib folder for docker-container driver docker/buildx#3320

Merged

elezar force-pushed the map-nvidia-requests-to-cdi branch 3 times, most recently from 6ea74f7 to 7aeb6d3 Compare August 13, 2025 13:13

elezar commented Aug 13, 2025

View reviewed changes

elezar force-pushed the map-nvidia-requests-to-cdi branch from 7aeb6d3 to 3386805 Compare August 14, 2025 12:05

elezar marked this pull request as ready for review August 14, 2025 12:45

elezar force-pushed the map-nvidia-requests-to-cdi branch from 3386805 to c28f616 Compare August 14, 2025 12:56

elezar changed the title ~~[WIP] Map nvidia requests to cdi~~ Use cdi device driver to handle nvidia --gpus requests Aug 14, 2025

thaJeztah added this to the 29.0.0 milestone Aug 14, 2025

elezar force-pushed the map-nvidia-requests-to-cdi branch from c28f616 to da71cc2 Compare August 15, 2025 08:27

elezar mentioned this pull request Aug 15, 2025

Match device driver on name and ignore capabilities #50717

Merged

vvoland modified the milestones: 29.0.0, 29.1.0 Nov 10, 2025

elezar mentioned this pull request Nov 14, 2025

Update --gpus flag to use CDI device requests containerd/containerd#12514

Closed

elezar mentioned this pull request Nov 26, 2025

Handle --gpus flag using CDI containerd/nerdctl#4617

Merged

vvoland added the backlog label Nov 26, 2025

vvoland modified the milestones: 29.1.0, 29.2.0 Nov 26, 2025

vvoland modified the milestones: 29.2.0, 29.3.0 Dec 15, 2025

thaJeztah added this to 🔦 Maintainer spotlight Jan 8, 2026

github-project-automation bot moved this to New in 🔦 Maintainer spotlight Jan 8, 2026

vvoland added the impact/changelog label Jan 8, 2026

vvoland added kind/refactor PR's that refactor, or clean-up code area/cdi labels Jan 8, 2026

elezar force-pushed the map-nvidia-requests-to-cdi branch from fff0e36 to 109cff0 Compare January 9, 2026 11:33

github-actions bot added area/daemon Core Engine area/dependencies labels Jan 9, 2026

thaJeztah reviewed Jan 12, 2026

View reviewed changes

elezar force-pushed the map-nvidia-requests-to-cdi branch 2 times, most recently from 4c30d65 to daa1213 Compare January 16, 2026 18:36

thaJeztah modified the milestones: 29.3.0, 29.2.0 Jan 19, 2026

vvoland reviewed Jan 19, 2026

View reviewed changes

daemon/devices_nvidia_linux.go Outdated Show resolved Hide resolved

vvoland reviewed Jan 19, 2026

View reviewed changes

daemon/devices_nvidia_linux.go Outdated Show resolved Hide resolved

Use Debug instead of Debugf for device request logging

2018b9e

Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar force-pushed the map-nvidia-requests-to-cdi branch from daa1213 to f918651 Compare January 19, 2026 13:10

vvoland approved these changes Jan 19, 2026

View reviewed changes

vvoland requested a review from thaJeztah January 19, 2026 16:09

thaJeztah approved these changes Jan 20, 2026

View reviewed changes

thaJeztah added the ci/validate-only label Jan 20, 2026

thaJeztah force-pushed the map-nvidia-requests-to-cdi branch from f918651 to 86f122a Compare January 20, 2026 13:36

thaJeztah added the status/4-merge label Jan 20, 2026

thaJeztah merged commit 5e01275 into moby:master Jan 20, 2026
54 checks passed

elezar deleted the map-nvidia-requests-to-cdi branch January 20, 2026 14:31

vvoland removed the backlog label Jan 21, 2026

vvoland moved this from New to Complete in 🔦 Maintainer spotlight Jan 21, 2026

vvoland removed the area/dependencies label Jan 26, 2026

		parts := strings.SplitN(deviceID, "=", 2)
		if len(parts) == 2 {

		// NvidiaCLI is the path to the Nvidia helper binary
		const NvidiaCLI = "nvidia-container-cli"

Conversation

elezar commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crazy-max commented Aug 13, 2025

Uh oh!

elezar commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez commented Aug 14, 2025

Uh oh!

elezar commented Aug 14, 2025

Uh oh!

thaJeztah commented Jan 8, 2026

Uh oh!

elezar commented Jan 9, 2026

Uh oh!

thaJeztah commented Jan 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vvoland left a comment

Choose a reason for hiding this comment

Uh oh!

elezar commented Jan 19, 2026

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

elezar commented Jun 18, 2025 •

edited

Loading

elezar commented Aug 13, 2025 •

edited

Loading