Added support for AMD GPUs in "docker run --gpus".#49952
Added support for AMD GPUs in "docker run --gpus".#49952thaJeztah merged 1 commit intomoby:masterfrom
Conversation
|
@sgopinath1 as a maintainer of the NVIDIA Container Toolkit and its components I would strongly recommend against using the environment variable to control this behaviour -- even as an interim solution. Adding this behaviour now means that we have to keep in mind when implemening a |
|
@elezar a couple of points:
|
|
We would love to be able to use --gpus for AMD! BTW, the AMD Container Toolkit is now published: https://instinct.docs.amd.com/projects/container-toolkit/en/latest/container-runtime/overview.html |
daemon/amd_linux.go
Outdated
| // countToDevicesAMD returns the list 0, 1, ... count-1 of deviceIDs. | ||
| func countToDevicesAMD(count int) string { | ||
| devices := make([]string, count) | ||
| for i := range devices { | ||
| devices[i] = strconv.Itoa(i) | ||
| } | ||
| return strings.Join(devices, ",") | ||
| } |
There was a problem hiding this comment.
This is the same implementation as countToDevices in nvidia_linux.go. Does it make sense to just use that function?
There was a problem hiding this comment.
Yes, makes sense. Changed accordingly.
daemon/amd_linux.go
Outdated
|
|
||
| const amdContainerRuntime = "amd-container-runtime" | ||
|
|
||
| func init() { |
There was a problem hiding this comment.
@sgopinath1 instead of having a separate init function for nvidia and amd GPUs, does it make sense to refactor this and the code in nvidia_linux.go to have a single init function that checks for the existence of the various executables and registers the drivers accordingly?
There was a problem hiding this comment.
Agreed. I have modified the init function in nvidia_linux.go to register the AMD driver also.
|
@elezar thanks for reviewing. I have updated the code as per your suggestions. Please review. |
daemon/nvidia_linux.go
Outdated
| const nvidiaHook = "nvidia-container-runtime-hook" | ||
| const ( | ||
| nvidiaHook = "nvidia-container-runtime-hook" | ||
| amdHook = "amd-container-runtime" |
There was a problem hiding this comment.
| amdHook = "amd-container-runtime" | |
| amdContainerRuntimeExecutableName = "amd-container-runtime" |
There was a problem hiding this comment.
Done as suggested.
|
@elezar Let me know if there are any further comments on the changes. Thanks. |
|
LGTM |
|
Could you do a quick rebase and squash the commits? |
Done. |
daemon/nvidia_linux.go
Outdated
| } else { | ||
| // no "gpu" capability | ||
| } |
There was a problem hiding this comment.
Hm.. looks like the linter doesn't like this (even with a comment inside the branch 🤔)
daemon/nvidia_linux.go:55:9: SA9003: empty branch (staticcheck)
} else {
^
| if _, err := exec.LookPath(nvidiaHook); err == nil { | ||
| capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}} | ||
| nvidiaDriver := &deviceDriver{ | ||
| capset: capset, | ||
| updateSpec: setNvidiaGPUs, | ||
| } | ||
| for c := range allNvidiaCaps { | ||
| nvidiaDriver.capset[string(c)] = struct{}{} | ||
| } | ||
| registerDeviceDriver("nvidia", nvidiaDriver) | ||
| } else if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil { | ||
| capset := capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}} | ||
| amdDriver := &deviceDriver{ | ||
| capset: capset, | ||
| updateSpec: setAMDGPUs, | ||
| } | ||
| registerDeviceDriver("amd", amdDriver) | ||
| } else { | ||
| // no "gpu" capability | ||
| } |
There was a problem hiding this comment.
Perhaps an early return would work;
| if _, err := exec.LookPath(nvidiaHook); err == nil { | |
| capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}} | |
| nvidiaDriver := &deviceDriver{ | |
| capset: capset, | |
| updateSpec: setNvidiaGPUs, | |
| } | |
| for c := range allNvidiaCaps { | |
| nvidiaDriver.capset[string(c)] = struct{}{} | |
| } | |
| registerDeviceDriver("nvidia", nvidiaDriver) | |
| } else if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil { | |
| capset := capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}} | |
| amdDriver := &deviceDriver{ | |
| capset: capset, | |
| updateSpec: setAMDGPUs, | |
| } | |
| registerDeviceDriver("amd", amdDriver) | |
| } else { | |
| // no "gpu" capability | |
| } | |
| if _, err := exec.LookPath(nvidiaHook); err == nil { | |
| capset := capabilities.Set{"gpu": struct{}{}, "nvidia": struct{}{}} | |
| for c := range allNvidiaCaps { | |
| capset[string(c)] = struct{}{} | |
| } | |
| registerDeviceDriver("nvidia", &deviceDriver{ | |
| capset: capset, | |
| updateSpec: setNvidiaGPUs, | |
| }) | |
| return | |
| } | |
| if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil { | |
| registerDeviceDriver("amd", &deviceDriver{ | |
| capset: capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}}, | |
| updateSpec: setAMDGPUs, | |
| }) | |
| return | |
| } | |
| // no "gpu" capability |
There was a problem hiding this comment.
Curious though; should amd and nvidia be considered mutually exclusive? Would splitting this into two init funcs (once in the nvidia file, one in the amd file) and both register a driver (if present) work?
There was a problem hiding this comment.
They are mutually exclusive, but I suggested that the same init function be used because @sgopinath1 was also checking for the NVIDIA runtime in the AMD init function. The intent was to ensure that the AMD runtime is not used if the NVIDIA hook is present and that the nvidia logic takes precedence.
It may be cleaner to repeat some code here to keep these logically separate.
There was a problem hiding this comment.
Perhaps an early return would work;
@thaJeztah Changed as suggested.
|
@thaJeztah Let me know if there are any further comments or the changes look good. Thanks. |
It's probably good to rename it as part of this PR to have them both follow the same naming pattern; git should be smart enough to handle the rename if the other PR is rebased. Can you do the above, and squash the commits ? |
|
Hm... actually, I just realised that effectively the related selection code is in
|
Added backend code to support the exact same interface used today for Nvidia GPUs, allowing customers to use the same docker commands for both Nvidia and AMD GPUs. Signed-off-by: Sudheendra Gopinath <sudheendra.gopinath@amd.com> Reused common functions from nvidia_linux.go. Removed duplicate code in amd_linux.go by reusing the init() and countToDevices() functions in nvidia_linux.go. AMD driver is registered in init(). Signed-off-by: Sudheendra Gopinath <sudheendra.gopinath@amd.com> Renamed amd-container-runtime constant Signed-off-by: Sudheendra Gopinath <sudheendra.gopinath@amd.com> Removed empty branch to keep linter happy. Also renamed amd_linux.go to gpu_amd_linux.go. Signed-off-by: Sudheendra Gopinath <sudheendra.gopinath@amd.com> Renamed nvidia_linux.go and gpu_amd_linux.go. Signed-off-by: Sudheendra Gopinath <sudheendra.gopinath@amd.com>
@thaJeztah I have renamed the files as above and squashed the commits. |
| // Register AMD driver if AMD helper binary is present. | ||
| if _, err := exec.LookPath(amdContainerRuntimeExecutableName); err == nil { | ||
| registerDeviceDriver("amd", &deviceDriver{ | ||
| capset: capabilities.Set{"gpu": struct{}{}, "amd": struct{}{}}, | ||
| updateSpec: setAMDGPUs, | ||
| }) | ||
| return |
There was a problem hiding this comment.
I think it's okay-ish for now, but since this already needs the AMD container toolkit to be installed... what I'd really love is:
- Shell out to
amd-ctk cdi generateto generate the actual CDI configs (on init, and then perhaps on everygpurequests?) - And then just rewrite the
gpusvalue to a proper CDI device ID and pass it to the CDI driver
This way, the user won't need to pass --runtime=amd to override the container runtime to the AMD runc wrapper.
|
I "hijacked" the original ticket to re-purpose it for the discussion on reimplementing |
thaJeztah
left a comment
There was a problem hiding this comment.
still LGTM
let's bring this one in
This change adds support for AMD GPUs in
docker run --gpuscommand.--gpusflag using CDI (was "AMD GPU support") #49824.- What I did
Added backend code to support the exact same interface used today for Nvidia GPUs, allowing customers to use the same docker commands for both Nvidia and AMD GPUs.
- How I did it
gpucapability.--gpusinput in the docker command to an Environment Variable,AMD_VISIBLE_DEVICES, that is handled by the AMD container runtime.- How to verify it
AMD container runtime must be installed on the system to verify this functionality. AMD container runtime is expected to be published as an open-source project soon.
The following commands can be used to specify which GPUs are required inside the container and
rocm-smioutput can be used to verify the correct GPUs are made available inside the container.To use all available GPUs
docker run --runtime=amd --gpus all rocm/rocm-terminal rocm-smiOR
docker run --runtime=amd --gpus device=all rocm/rocm-terminal rocm-smiTo use any 2 GPUs
docker run --runtime=amd --gpus 2 rocm/rocm-terminal rocm-smiTo use a set of specific GPUs
docker run --runtime=amd --gpus 1,2,3 rocm/rocm-terminal rocm-smiOR
docker run --runtime=amd --gpus '"device=1,2,3"' rocm/rocm-terminal rocm-smi- Human readable description for the release notes