Remove support for specifying nvidia driver capabilities in --gpus flag#50099
Remove support for specifying nvidia driver capabilities in --gpus flag#50099elezar wants to merge 2 commits intomoby:masterfrom
Conversation
|
cc @thaJeztah |
|
We would also need to remove the reference here: https://docs.docker.com/engine/containers/resource_constraints/#set-nvidia-capabilities |
|
We would also need to adjust the compose spec: |
|
Curious; would these options be something we should produce an error for when set, or (I guess they're a bit of an niche case) is it safe to silently ignore if they're used? Alternatively, something in the middle would be to return a warning; moby/api/server/router/container/container_routes.go Lines 665 to 668 in 9663b36 moby/api/server/router/container/container_routes.go Lines 706 to 707 in 9663b36
Yes, also the compose docs, looks like; IIUC, those examples could be updated by setting the env-var manually instead, correct? If that's the case, and if that already works, perhaps would be good to already update the examples to match that.
Yup; not sure how deprecating works there (ISTR the compose-spec may not deprecate things, but implementations are allowed to return an error if they don't support a feature if set, so ... maybe it's just docs; not 100% sure though!) We should probably a deprecation in;
|
| # gpu AND nvidia | ||
| - ["gpu", "nvidia"] |
There was a problem hiding this comment.
As ["gpu", "nvidia"] (and with #49952) ["gpu", "amd"] will be the only remaining options, and both are a 1:1 match for driver=nvidia and driver=amd, I'm even wondering how dirty it would be to strip requiring these options at all, and just skip the selection by capabilities for both drivers;
Lines 59 to 61 in 9663b36
We would / could still support them if set, but simplify the default and only select by driver for those specific cases. 🤔
There was a problem hiding this comment.
Is what you're proposing that we either select by driver OR by capabilities -- meaning that we would have something like:
func (daemon *Daemon) handleDevice(req container.DeviceRequest, spec *specs.Spec) error {
if dd := deviceDrivers[req.Driver]; dd != nil {
return dd.updateSpec(spec, &deviceInstance{req: req})
}
for _, dd := range deviceDrivers {
if selected := dd.capset.Match(req.Capabilities); selected != nil {
return dd.updateSpec(spec, &deviceInstance{req: req, selectedCaps: selected})
}
}
return incompatibleDeviceRequest{req.Driver, req.Capabilities}
}(possibly forwarding the requested capabilities when a driver matches)
I think this will be easier to reason about.
Note that the matching of drivers to capabilities is non-deterministic since we're we're iterating over a map. This was not a problem when we only had a single driver, but is an issue now.
There was a problem hiding this comment.
LOL have to recollect what my thinking was when I wrote, but I think that's roughly it yes;
Mostly considering;
if dd := deviceDrivers[req.Driver]; dd != nil {
return dd.updateSpec(spec, &deviceInstance{req: req})
}if we have no selectors other than gpu and (nvidia|amd), which both the nvidia and amd drivers would satisfy, would there be any situation where we would find a nvidia or amd driver that did not match those selectors / capabilities (from my comment above, I don't think so)
So the only reason to match on capabilities would be if no driver is specified, in which case we'd be matching only based on capabilities (either "gpu" or "amd", or both). Specifying only "gpu" would indeed make it non-deterministic (but I think that's already the case?)
There was a problem hiding this comment.
Given the ambiguity, is removing support for capabilities entirely (even gpu & (nvidia | amd)) and only supporting the driver name an option? I'm not sure what the community things about this as a breaking change.
There was a problem hiding this comment.
I have added a commit on top that changes the selection behaviour. Let me know what you think.
@vvoland I have created compose-spec/compose-go#798 what do you think? |
This change removes support for specifying NVIDIA_DRIVER_CAPABILITIES as capabilities in the --gpus flag (or through docker compose). The motivation for this is: 1. The list of capabilities is incomplete and has not been updated for new capabilities. 2. Handling of capabilities complicats the migration to CDI 3. The handling of capabilities on the --gpus flag conflicts with the NVIDIA_DRIVER_CAPABILITES envvar that is specified in the image or on the command line. Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ignores requested capabilities when a driver is explicitly requested. This simplifies the logic for selecting a driver and means that users need not spefify redundant capabilities. Signed-off-by: Evan Lezar <elezar@nvidia.com>
1d9f1c9 to
c1257b1
Compare
|
I have moved the non-deprecation changes to #50717. |
- What I did
Updated the code that handles the injection of the
nvidia-container-runtime-hookto not handle NVIDIA-specific capabilities instead relying on theNVIDIA_DRIVER_CAPABILITIESenvironment variable.The motivation for this is:
See also:
--gpusflag using CDI (was "AMD GPU support") #49824- How I did it
Removed references to and processing of NVIDIA-specific capabilities.
- How to verify it
Running:
should show:
- Human readable description for the release notes
- A picture of a cute animal (not mandatory but encouraged)