Use CDI for GPU injection for AMD devices for --gpus#52048
Use CDI for GPU injection for AMD devices for --gpus#52048vvoland merged 2 commits intomoby:masterfrom
Conversation
daemon/devices_amd_linux.go
Outdated
|
|
||
| // Try to detect AMD GPU vendor via CDI cache if cdiCache is available | ||
| if cdiCache != nil { | ||
| vendor, err := discoverGPUVendorFromCDI(cdiCache) |
There was a problem hiding this comment.
One thing about this approach ... this only checks whether the cache includes AMD cdi devices at the point where the daemon is reloaded. In contrast to the other projects where we have added this functionlity, the cache here is started with AutoRefresh enabled meaning that the CDI spec directories are watched for changes to ensure that specs for new devices are detected.
With that in mind, the drivers that one wants to register would have to determine the vendor from the cache for every --gpus request and not only once at startup.
There was a problem hiding this comment.
Yes, makes sense.
I have updated the logic to always discover vendor from CDI registry on the fly since the registry is auto refreshed. I have also verified it is working as expected by deleting the CDI files while the daemon is running and verifying that the vendor discovery fails in that case.
Thanks for the suggestion.
f0f8245 to
faba468
Compare
faba468 to
5ec493c
Compare
|
Hey @vvoland I have tried answering your comments. Can you please review again and let me know if you still see any issues? I will be more than happy to improve. Thanks! |
|
Code looks good, but the CI is failing on linting issues |
5ec493c to
735a141
Compare
|
Failure on Windows: |
|
Looks like this is because Would you recommend to create a corresponding _windows.go (or _unsupported.go) file and add a dummy implementation there? |
735a141 to
bc4c3f2
Compare
|
@vvoland I have added a no-op implementation of that function in devices_unsupported.go. |
| import "tags.cncf.io/container-device-interface/pkg/cdi" | ||
|
|
||
| // RegisterGPUDeviceDrivers is a no-op on non-Linux platforms. | ||
| func RegisterGPUDeviceDrivers(_ *cdi.Cache) {} |
There was a problem hiding this comment.
nit: It would be a bit cleaner to have the devices_linux.go file with the RegisterGPUDeviceDrivers implementation so we avoid the empty stub.
There was a problem hiding this comment.
I have created device_linux.go which overrides RegisterGPUDeviceDrivers with registerGPUDeviceDrivers which is linux specific implementation. This avoids the empty stub.
There was a problem hiding this comment.
Oh, sorry I missed that RegisterGPUDeviceDrivers is actually called from a non "linux" file, so your previous was actually correct 🤦🏻
I can fix that in a follow up PR, so we don't need to block this one.
There was a problem hiding this comment.
Sure. I was also a bit confused... I assumed you are trying to avoid an _unsupported.go file and looking for a _linux.go somehow.
There was a problem hiding this comment.
I'm going to wait for Evan to give a look too so I pushed 13a8626 to address this.
Feel free to squash it with your PR if you prefer!
Signed-off-by: Shiv Tyagi <Shiv.Tyagi@amd.com>
bc4c3f2 to
561a5a9
Compare
|
I don't think the test failure is related. Please let me know if it is. Thank you so much @vvoland for all your time and efforts in reviewing this. |
|
@elezar want to give this one a last look? |
I just finished off for the day. I'll have a look tomorrow first thing! |
Change GPU device driver registration from init-time function assignment to direct function exports. This removes the init() function and global function variable pattern in favor of a more explicit approach. Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
|
@elezar I'll get this in to unblock 29.3.0. Feel free to still give this one a look though! |
Thanks. I did have some thoughts, but was working on formulating them more clearly. I'll either comment here, or create a PR with some minor changes proposed. |
| cdiCache = daemon.RegisterCDIDriver(cli.Config.CDISpecDirs...) | ||
| } | ||
|
|
||
| daemon.RegisterGPUDeviceDrivers(cdiCache) |
There was a problem hiding this comment.
As a general note, I think doing this explicitly instead of relying on init makes things cleaner.
There was a problem hiding this comment.
Hey @elezar! This was needed to ensure we use the same cdiCache which is initialized above before this call.
Sorry I am not following. We are not using init() here. Can you explain more?
There was a problem hiding this comment.
Yes. I was just saying that I think this change is for the better. Thanks for this.
| return fmt.Errorf("failed to discover GPU vendor from CDI: %w", err) | ||
| } | ||
|
|
||
| if vendor != "amd.com" { |
There was a problem hiding this comment.
Thinking as one f the maintainers of the CDI tooling, adding a ListKinds function to a cdi.Cache will allow us to be more accurate with this check. There may be OTHER kinds (e.g. amd.com/not.gpu) that would cause the vendor detection to pass, but the amd.com/gpu kind could still be unknown.
Since we fall throug to the runtime-based update in this case anyway, I don't think it's too critical to address though.
There was a problem hiding this comment.
I agree. We can add that function and use that is all CLIs for --gpus.
|
I was a bit delayed in getting to this. I have no blocking comments, but did create #52145 as a follow-up. |
Closes #49824
This PR enhances the functionality of the
--gpusoption for AMD GPUs by utilizing CDI (Container Device Interface) specs for device injection when available. It falls back to the existing vendor runtime-based injection if AMD CDI specs are not detected on the machine.Related PR: containerd/containerd#12839 (Similar implementation for
containerd/ctr)- What I did
Added support for CDI-based GPU device injection through
--gpusoption for AMD devices.- How I did it
Created a similar composite device driver like NVIDIA's which discovers if AMD's CDI specs are there on the system during registration and registers itself with appropriate updaters to handle the device request.
- How to verify it
make binary.dockerdinstance via./bundles/binary/dockerd.amd-ctk cdi generateto install the CDI specs on the host.docker run --rm --gpus all rocm/rocm-terminal rocm-smi.dockerdprocess.docker run --rm --runtime="amd" --gpus all rocm/rocm-terminal rocm-smi.AMD_VISIBLE_DEVICESset when CDI specs are not there to verify that the fallback is working correctly.I have also added unit tests for vendor discovery function.
- Human readable description for the release notes
- A picture of a cute animal (not mandatory but encouraged)