Skip to content

Use CDI for GPU injection for AMD devices for --gpus#52048

Merged
vvoland merged 2 commits intomoby:masterfrom
shiv-tyagi:vendor-detection
Mar 5, 2026
Merged

Use CDI for GPU injection for AMD devices for --gpus#52048
vvoland merged 2 commits intomoby:masterfrom
shiv-tyagi:vendor-detection

Conversation

@shiv-tyagi
Copy link
Contributor

@shiv-tyagi shiv-tyagi commented Feb 16, 2026

Closes #49824

This PR enhances the functionality of the --gpus option for AMD GPUs by utilizing CDI (Container Device Interface) specs for device injection when available. It falls back to the existing vendor runtime-based injection if AMD CDI specs are not detected on the machine.

Related PR: containerd/containerd#12839 (Similar implementation for containerd/ctr)

- What I did
Added support for CDI-based GPU device injection through --gpus option for AMD devices.

- How I did it
Created a similar composite device driver like NVIDIA's which discovers if AMD's CDI specs are there on the system during registration and registers itself with appropriate updaters to handle the device request.

- How to verify it

  1. Built the binaries using make binary.
  2. Started the newly built dockerd instance via ./bundles/binary/dockerd.
  3. Used amd-ctk cdi generate to install the CDI specs on the host.
  4. Test Injection: Ran docker run --rm --gpus all rocm/rocm-terminal rocm-smi.
    • Verified that CDI-based GPU injection works as expected.
  5. Test (Fallback Path):
    • Deleted the CDI specs and restarted the dockerd process.
    • Retried using the runtime flag: docker run --rm --runtime="amd" --gpus all rocm/rocm-terminal rocm-smi.
    • Verified that the vendor runtime still works as a fallback when CDI specs are absent.
    • Checked the environment variable inside the container has AMD_VISIBLE_DEVICES set when CDI specs are not there to verify that the fallback is working correctly.

I have also added unit tests for vendor discovery function.

- Human readable description for the release notes

The `--gpus` option now uses CDI-based injection for AMD GPUs.

- A picture of a cute animal (not mandatory but encouraged)

@github-actions github-actions bot added the area/daemon Core Engine label Feb 16, 2026

// Try to detect AMD GPU vendor via CDI cache if cdiCache is available
if cdiCache != nil {
vendor, err := discoverGPUVendorFromCDI(cdiCache)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing about this approach ... this only checks whether the cache includes AMD cdi devices at the point where the daemon is reloaded. In contrast to the other projects where we have added this functionlity, the cache here is started with AutoRefresh enabled meaning that the CDI spec directories are watched for changes to ensure that specs for new devices are detected.

With that in mind, the drivers that one wants to register would have to determine the vendor from the cache for every --gpus request and not only once at startup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense.

I have updated the logic to always discover vendor from CDI registry on the fly since the registry is auto refreshed. I have also verified it is working as expected by deleting the CDI files while the daemon is running and verifying that the vendor discovery fails in that case.

Thanks for the suggestion.

@vvoland vvoland added kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. area/cdi labels Feb 16, 2026
@vvoland vvoland added this to the 29.3.0 milestone Feb 16, 2026
@shiv-tyagi
Copy link
Contributor Author

Hey @vvoland I have tried answering your comments. Can you please review again and let me know if you still see any issues? I will be more than happy to improve. Thanks!

@vvoland
Copy link
Contributor

vvoland commented Mar 4, 2026

Code looks good, but the CI is failing on linting issues

@vvoland
Copy link
Contributor

vvoland commented Mar 4, 2026

Failure on Windows:

0.750 cmd/dockerd/main_windows.go
0.773 Building static /tmp/bundles/binary-daemon/dockerd.exe (windows/amd64)...
79.23 # github.com/moby/moby/v2/daemon/command
79.23 daemon/command/daemon.go:288:9: undefined: daemon.RegisterGPUDeviceDrivers

@shiv-tyagi
Copy link
Contributor Author

shiv-tyagi commented Mar 4, 2026

Looks like this is because RegisterGPUDeviceDrivers() is defined only inside devices_nvidia_linux.go?

Would you recommend to create a corresponding _windows.go (or _unsupported.go) file and add a dummy implementation there?

@shiv-tyagi
Copy link
Contributor Author

@vvoland I have added a no-op implementation of that function in devices_unsupported.go. make win runs fine for me locally now. Please check and let me know if that is not the correct way of doing it. Thanks.

import "tags.cncf.io/container-device-interface/pkg/cdi"

// RegisterGPUDeviceDrivers is a no-op on non-Linux platforms.
func RegisterGPUDeviceDrivers(_ *cdi.Cache) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It would be a bit cleaner to have the devices_linux.go file with the RegisterGPUDeviceDrivers implementation so we avoid the empty stub.

Copy link
Contributor Author

@shiv-tyagi shiv-tyagi Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created device_linux.go which overrides RegisterGPUDeviceDrivers with registerGPUDeviceDrivers which is linux specific implementation. This avoids the empty stub.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry I missed that RegisterGPUDeviceDrivers is actually called from a non "linux" file, so your previous was actually correct 🤦🏻

I can fix that in a follow up PR, so we don't need to block this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I was also a bit confused... I assumed you are trying to avoid an _unsupported.go file and looking for a _linux.go somehow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to wait for Evan to give a look too so I pushed 13a8626 to address this.

Feel free to squash it with your PR if you prefer!

Signed-off-by: Shiv Tyagi <Shiv.Tyagi@amd.com>
@shiv-tyagi
Copy link
Contributor Author

I don't think the test failure is related. Please let me know if it is.

Thank you so much @vvoland for all your time and efforts in reviewing this.

Copy link
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@vvoland
Copy link
Contributor

vvoland commented Mar 4, 2026

@elezar want to give this one a last look?

@elezar
Copy link
Contributor

elezar commented Mar 4, 2026

@elezar want to give this one a last look?

I just finished off for the day. I'll have a look tomorrow first thing!

Change GPU device driver registration from init-time function assignment
to direct function exports.

This removes the init() function and global function variable pattern in
favor of a more explicit approach.

Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
@vvoland
Copy link
Contributor

vvoland commented Mar 5, 2026

@elezar I'll get this in to unblock 29.3.0. Feel free to still give this one a look though!

@vvoland vvoland merged commit 4c19a01 into moby:master Mar 5, 2026
252 of 253 checks passed
@elezar
Copy link
Contributor

elezar commented Mar 5, 2026

@elezar I'll get this in to unblock 29.3.0. Feel free to still give this one a look though!

Thanks. I did have some thoughts, but was working on formulating them more clearly. I'll either comment here, or create a PR with some minor changes proposed.

cdiCache = daemon.RegisterCDIDriver(cli.Config.CDISpecDirs...)
}

daemon.RegisterGPUDeviceDrivers(cdiCache)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general note, I think doing this explicitly instead of relying on init makes things cleaner.

Copy link
Contributor Author

@shiv-tyagi shiv-tyagi Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @elezar! This was needed to ensure we use the same cdiCache which is initialized above before this call.

Sorry I am not following. We are not using init() here. Can you explain more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I was just saying that I think this change is for the better. Thanks for this.

return fmt.Errorf("failed to discover GPU vendor from CDI: %w", err)
}

if vendor != "amd.com" {
Copy link
Contributor

@elezar elezar Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking as one f the maintainers of the CDI tooling, adding a ListKinds function to a cdi.Cache will allow us to be more accurate with this check. There may be OTHER kinds (e.g. amd.com/not.gpu) that would cause the vendor detection to pass, but the amd.com/gpu kind could still be unknown.

Since we fall throug to the runtime-based update in this case anyway, I don't think it's too critical to address though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We can add that function and use that is all CLIs for --gpus.

@elezar
Copy link
Contributor

elezar commented Mar 5, 2026

I was a bit delayed in getting to this. I have no blocking comments, but did create #52145 as a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cdi area/daemon Core Engine area/testing impact/changelog kind/enhancement Enhancements are not bugs or new features but can improve usability or performance.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

re-implement --gpus flag using CDI (was "AMD GPU support")

3 participants