Skip to content

Remove support for specifying nvidia driver capabilities in --gpus flag#50099

Draft
elezar wants to merge 2 commits intomoby:masterfrom
elezar:remove-nvidia-gpu-capabilities-from-gpus-flag
Draft

Remove support for specifying nvidia driver capabilities in --gpus flag#50099
elezar wants to merge 2 commits intomoby:masterfrom
elezar:remove-nvidia-gpu-capabilities-from-gpus-flag

Conversation

@elezar
Copy link
Contributor

@elezar elezar commented May 28, 2025

- What I did

Updated the code that handles the injection of the nvidia-container-runtime-hook to not handle NVIDIA-specific capabilities instead relying on the NVIDIA_DRIVER_CAPABILITIES environment variable.

The motivation for this is:

  1. The list of capabilities is incomplete and has not been updated for new capabilities.
  2. Handling of capabilities complicats the migration to CDI
  3. The handling of capabilities on the --gpus flag conflicts with the NVIDIA_DRIVER_CAPABILITES envvar that is specified in the image or on the command line.

See also:

- How I did it

Removed references to and processing of NVIDIA-specific capabilities.

- How to verify it

Running:

docker run --rm -ti --gpus `all,capabilities=utility` -e NVIDIA_DRIVER_CAPABILTIES=all ubuntu env | grep NVIDIA_DRIVER_CAPABILTIES

should show:

NVIDIA_DRIVER_CAPABILITIES=all

- Human readable description for the release notes

Removed support for specifying NVIDIA-specific capabilities. This affects the `--gpus` flag and the Docker Compose specification.

- A picture of a cute animal (not mandatory but encouraged)

@elezar
Copy link
Contributor Author

elezar commented May 28, 2025

cc @thaJeztah

@elezar
Copy link
Contributor Author

elezar commented May 28, 2025

We would also need to remove the reference here: https://docs.docker.com/engine/containers/resource_constraints/#set-nvidia-capabilities

@thaJeztah
Copy link
Member

Curious; would these options be something we should produce an error for when set, or (I guess they're a bit of an niche case) is it safe to silently ignore if they're used? Alternatively, something in the middle would be to return a warning;

var warnings []string
if warn := handleVolumeDriverBC(version, hostConfig); warn != "" {
warnings = append(warnings, warn)
}

ccr.Warnings = append(ccr.Warnings, warnings...)
return httputils.WriteJSON(w, http.StatusCreated, ccr)

We would also need to remove the reference here: https://docs.docker.com/engine/containers/resource_constraints/#set-nvidia-capabilities

Yes, also the compose docs, looks like;
https://docs.docker.com/compose/how-tos/gpu-support/#example-of-a-compose-file-for-running-a-service-with-access-to-1-gpu-device

IIUC, those examples could be updated by setting the env-var manually instead, correct? If that's the case, and if that already works, perhaps would be good to already update the examples to match that.

We would also need to adjust the compose spec:

Yup; not sure how deprecating works there (ISTR the compose-spec may not deprecate things, but implementations are allowed to return an error if they don't support a feature if set, so ... maybe it's just docs; not 100% sure though!)

We should probably a deprecation in;

Comment on lines +318 to +319
# gpu AND nvidia
- ["gpu", "nvidia"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As ["gpu", "nvidia"] (and with #49952) ["gpu", "amd"] will be the only remaining options, and both are a 1:1 match for driver=nvidia and driver=amd, I'm even wondering how dirty it would be to strip requiring these options at all, and just skip the selection by capabilities for both drivers;

moby/daemon/devices.go

Lines 59 to 61 in 9663b36

if selected := dd.capset.Match(req.Capabilities); selected != nil {
return dd.updateSpec(spec, &deviceInstance{req: req, selectedCaps: selected})
}

We would / could still support them if set, but simplify the default and only select by driver for those specific cases. 🤔

Copy link
Contributor Author

@elezar elezar Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is what you're proposing that we either select by driver OR by capabilities -- meaning that we would have something like:

func (daemon *Daemon) handleDevice(req container.DeviceRequest, spec *specs.Spec) error {
	if dd := deviceDrivers[req.Driver]; dd != nil {
		return dd.updateSpec(spec, &deviceInstance{req: req})
	}

	for _, dd := range deviceDrivers {
		if selected := dd.capset.Match(req.Capabilities); selected != nil {
			return dd.updateSpec(spec, &deviceInstance{req: req, selectedCaps: selected})
		}
	}
	return incompatibleDeviceRequest{req.Driver, req.Capabilities}
}

(possibly forwarding the requested capabilities when a driver matches)

I think this will be easier to reason about.

Note that the matching of drivers to capabilities is non-deterministic since we're we're iterating over a map. This was not a problem when we only had a single driver, but is an issue now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL have to recollect what my thinking was when I wrote, but I think that's roughly it yes;

Mostly considering;

if dd := deviceDrivers[req.Driver]; dd != nil {
	return dd.updateSpec(spec, &deviceInstance{req: req})
}

if we have no selectors other than gpu and (nvidia|amd), which both the nvidia and amd drivers would satisfy, would there be any situation where we would find a nvidia or amd driver that did not match those selectors / capabilities (from my comment above, I don't think so)

So the only reason to match on capabilities would be if no driver is specified, in which case we'd be matching only based on capabilities (either "gpu" or "amd", or both). Specifying only "gpu" would indeed make it non-deterministic (but I think that's already the case?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the ambiguity, is removing support for capabilities entirely (even gpu & (nvidia | amd)) and only supporting the driver name an option? I'm not sure what the community things about this as a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a commit on top that changes the selection behaviour. Let me know what you think.

elezar added 2 commits June 18, 2025 18:19
This change removes support for specifying NVIDIA_DRIVER_CAPABILITIES as
capabilities in the --gpus flag (or through docker compose).

The motivation for this is:
1. The list of capabilities is incomplete and has not been
   updated for new capabilities.
2. Handling of capabilities complicats the migration to CDI
3. The handling of capabilities on the --gpus flag conflicts
   with the NVIDIA_DRIVER_CAPABILITES envvar that is specified
   in the image or on the command line.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ignores requested capabilities when a driver is explicitly
requested. This simplifies the logic for selecting a driver and means
that users need not spefify redundant capabilities.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the remove-nvidia-gpu-capabilities-from-gpus-flag branch from 1d9f1c9 to c1257b1 Compare June 18, 2025 16:20
@thompson-shaun thompson-shaun modified the milestones: 29.0.0, 29.1.0 Aug 5, 2025
@elezar elezar marked this pull request as draft August 14, 2025 13:46
@elezar
Copy link
Contributor Author

elezar commented Aug 14, 2025

I have moved the non-deprecation changes to #50717.

@thompson-shaun thompson-shaun modified the milestones: 29.1.0, 30.0.0 Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants